Real Time Data Warehousing

Published in

Granular Engineering

4 min readAug 22, 2018

By Zack Brunson

At Granular we spend a lot of time collecting data from our customers. For example, we track how much seed was put on a field and how much was harvested from that field. Collecting this data doesn’t do much by it self, so we also spend a lot of our time reporting on this data. Initially writing SQL queries (via SQLAlchemy) to aggregate various bits of data and serve that up to the front-end was good enough, but that method quickly became cumbersome and slow. Not only were we spending too much time developing and maintaining queries, but they were taking long to run. To combat both of these problems we built a data warehouse to store the data we need to report to our customers in a format that can be queried quickly and efficiently.

Traditionally this would be done via batch jobs that run periodically, perform their computation, and store it so that it could be served to whoever needs it. This works pretty well, but is limited to only having the freshest data ready for consumption right after the batch job has run. Since this is data our customers use to better manage their operations on a continual basis, we felt that limiting them to only being able to get periodic reports of their data was not sufficient. In order to meet their needs we decided that we needed to provide an always up to date source of information that we could report off of.

System Overview

Building this real time data warehouse proved to be an interesting challenge. The key components of the system are:

A change log for keeping track of every insert, update, and delete in our database
A guaranteed delivery system for sending the entries in this log to the processors
An ETL system for processing the log entries and storing the processed data

The Change Log

In order to have our data warehouse always be up to date, we first have to keep track of every change that happens in the system. We achieved this by creating a change log table in our database that stores the table name and primary key for the affected record when a change (insert, update, or delete) occurs. The change log table is populated via triggers on every other table in the database. Not only does this approach provide us the list of changes we need to update our data warehouse, it is easy to maintain, efficient, and transactional since it is all in database.

The Message Delivery System

Next we need to notify the ETL code about these changes so that the transformed data is ready to serve to our customers. This system has to deliver every message and ensure that it gets processed, otherwise we will be reporting incorrect data and making our system untrustworthy. Our primary message delivery mechanism is to have PostgreSQL add a notification to an in database queue about inserts to our change log. We then have a daemon which listens to this queue and pushes these notifications onto an RQ queue that our ETL system subscribes to.

The queueing mechanism covers far more than 99% of our message delivery, but to provide the robust, guaranteed delivery that we require we have our second mechanism. This second mechanism is a script that we run every few minutes to requeue messages that failed too many times. Having a script that runs every few minutes to catch stragglers may seem like we are breaking the real time piece of our real time data warehouse, but in practice the errors are due to duplicate key constraint violations caused by the large amount of concurrency that goes on in this system. This means that another message has already caused the data to get updated, and we only need to retry the message as a precaution so that we are sure that all of our updates are fully processed.

The ETL System

Now that the change notification system is in place we just need the ETL piece. For this we have an event registration model where each piece of the ETL system registers to get notifications about tables that affect the data that it is performing a computation on. With the event registration in place the ETL code can focus on the business logic needed to get the data ready to be reported on. We can also have as many processes as we need listening to the RQ queues since each piece of the data warehouse is self contained and we are not trying to maintain shared state outside the database.

Between the change log, the message delivery system, and the ETL system we have our real time data warehouse. From here we can use the read friendly schema that we have in our data warehouse for quick and easy reporting.

Real Time Data Warehousing

Written by Granular Engineering