Real-time analytics using Postgres

Ramanan Balakrishnan
Engineering@Semantics3
5 min readOct 2, 2017

Build you a faster dashboard for great good.

Let’s face it, no one likes a slow website. Whether you are loading your favorite memes or trying to turn in those TPS reports, it is always frustrating to be kept waiting, staring against an animated spinner.

yeaa, that would be great

At Semantics3, in the process of collecting as much data as possible, the team often goes crazy with gathering numbers across different engineering systems. And with databases ballooning, queries become expensive and slowdown of analytics dashboards inevitable.

In an earlier article, we had described our pipeline for gathering stats and our use of time-series databases for measuring application performance.

Today, I would like to focus on a slightly different system, also an analytics component, but powered using a Postgres backend instead. This post discusses issues that we faced as we collected more data, and the steps that were taken to improve query performance.

*note: A lot of the content here is inspired by the excellent tutorial at PGConfSV by Andres Freund and Marco Slot of Citus Data.

Too much of a good thing

As a part of our offering for ecommerce stores, we track view, click and purchase events for various online platforms. By aggregating and analyzing these numbers, we are able to offer actionable insights for driving positive consumer behavior.

A while back, with increasing popularity, we were starting to onboard stores far beyond what we had initially planned for. At one point, we were collecting on the order of 10 million events every day — considering all the other problems that we could have had, the problem of having to scale our systems was probably a good one to have.

A typical event log might be as below.

{ "type": "view",  "time": "2017-09-01T05:19:50Z", "id": ... }
{ "type": "click", "time": "2017-09-01T05:19:52Z", "id": ... }
{ "type": "view", "time": "2017-09-01T05:20:10Z", "id": ... }
{ "type": "view", "time": "2017-09-01T05:21:21Z", "id": ... }
{ "type": "click", "time": "2017-09-01T05:21:31Z", "id": ... }
{ "type": "purchase", "time": "2017-09-01T05:24:59Z", "id": ... }
...

In order to perform our number crunching, we (obviously) had to record these events in some sort of data store. Considering our experience in past battles and looking at the inherent relational structure within the data, we decided to stick with Postgres.

So, to start simple, an events table to log these events was created.

To run daily counts, we could simply,

Initially, this query seemed to run fast enough, providing us with the stats within a few seconds. But, as you can probably tell, as time went on, the query started slowing down.

We had collected significant data and the time taken was slowly creeping upwards. For example, when run on a test machine against partial data (5 million rows), it took ~22 seconds —definitely a cause of concern.

After confirming that such load times would not be acceptable to our users, we had to go in search of processes to improve the performance of these such aggregating queries.

Quick fixes for easy gains

One quick way to solve this problem was to make use of Materialized Views.

Most people might be familiar with views — which are like predefined lenses for presenting results while still querying underlying data. Materialized views are similar, with one major difference being that they pull up and save the queried data.

We defined a materialized view and ran the previous query again.

Apart from the lead time to set up the materialized view (71 seconds), subsequent queries seem to run much faster (1.2 ms).

The CREATE MATERIALIZED VIEW step performs the heavy lifting and calculates the rows, so that subsequent queries become relatively inexpensive. If we were to run 1000 queries, the raw aggregation commands would take about 6 hours, while the materialized view (including setup time) would take 72 seconds, a speed-up of almost 300x!

But, where is the catch, you ask? Of course, there is a catch.

Materialized views can go out of sync with the underlying data and need to be refreshed whenever that data changes. So every time we load new events into the events table, we would need to run aREFRESH MATERIALIZED VIEW [CONCURRENTLY] to keep the view in-sync.

And yes, these REFRESH commands take the same order of time as the original CREATE command. So, by the time a refresh is completed, the events table has more rows and the view is again out-of-sync and considered stale.

Increasing complexity might be a good thing

Looks like materialized views don’t really enable us to provide real-time analytics when heavy write loads are to be expected.

An alternate architecture to solve this problem, is to manually setup aggregate tables (caches) for existing data and keep track of incoming events separately (diff queues).

It is easier to describe this setup using the diagram below.

The table daily_counts_cached contains aggregate data from the events table, while the table daily_counts_queue serves as a placeholder for logging incoming values which have not yet been rolled up into daily_counts_cached.

By making use of INSERT/UPDATE/DELETE triggers listening on events , it is possible to keep values updated in daily_counts_queue and subsequently in daily_counts_cached as well.

Can’t we just rollup into daily_counts_cached, why do we need the queue? Well, it is due to the fact that the incoming event stream is collected across multiple machines. Each collector trying to modify the same row in daily_counts_cached would result in a lot of row locking — this can cause significant hits to write performance, delaying the loading of events in to the database.

Even though this setup is easy to describe, the code required to set it up is fairly verbose. The triggers need to be carefully setup to handle the various INSERT/UPDATE/DELETE operations on the events table. Instead of filling your page here with code blocks, I recommend you checkout the the files at the repository created for the PGConfSV tutorial.

It is now possible to run a query which combines data from both these tables to obtain stats.

tada ! 🎉

Looks like our new setup works, and still maintains the impressive speedup we saw before. However, this time, the write performance is not an issue and we can still ensure that the stats include the most recent values available in our database.

As the daily_counts_queue grows, the aggregation costs could potentially degrade performance. However, by tuning the periodic trigger frequency, we can make sure that this “intermediate” table does not become unmanageable.

That was a long ride

Using this setup, we are now able to use Postgres to reliably perform fast queries over large datasets.

The data/queue/cache model allows us to continue using our relational definitions, while still offering good performance. Now, we are able to easily support queries over larger ranges without worrying about the computational effort to aggregate over millions of rows. The significant reduction in response times are great too!

If you are interested in additional details (and a few other alternate implementations), do check out the full workshop referred to earlier.

--

--

Ramanan Balakrishnan
Engineering@Semantics3

Data Scientist | Electrical Engineer | Lockpick | Diver. Not always in that order.