Building a Streaming Database for Fun (Not Profit)

I’m a really big fan of databases. I think it’s because the work is low-level enough that I feel like I have control over all data storage, all logic, and all concurrency issues, but high-level enough that I don’t have to deal with registers. I also love that it’s work that is applicable to every possible other project. Want to build anything that isn’t a static webpage? Get you some database.

That’s actually what drove me towards this project to begin with. Last fall, I worked with two other Caltech students to build a live-updating data visualization dashboard for tweets about the election (RIP). I designed the architecture of the project and set everything up on AWS. We used a couple of EC2 instances to connect to the Twitter Public Streaming API, analyzed the tweets that came through with a couple of machine learning models trained to classify them by sentiment and ideology of the author (liberal or conservative), performed some aggregation, and updated a DynamoDB instance with the aggregates. We used Elastic Beanstalk to host our Node app, which read data out of DynamoDB and used D3.js to update our dashboard. We had a couple of visualizations, illustrating differences in tweet volume, sentiment, and vocabulary between twitter users with “liberal” and “conservative” ideologies (classified using our ML models) on four different topics concerning the election and election day.

These are some screenshots of our visualizations of tweets from inside of the USA containing keywords about polling places at 8pm on 11/09/2017. RIP.

That project was especially fun because of all of the new technologies I worked with in order to build it. I had the chance to learn some web development, a little bit of DevOps, and a little bit of data visualization. One of my biggest takeaways from this project was how many options there were for setting up streaming data pipelines, and how complicated they were to set up.

Note that our AWS architecture involved three different services, and I’d hacked it together in about two weeks. I was positive that there would have been an easier way to set up a data pipeline and not have to worry about learning how to use NoSQL databases, or having to use a zillion different Amazon products. In my research for pipeline solutions, I ran across a couple of options, like Kinesis Streaming and PipelineDB. I was inspired by these solutions (particularly PipelineDB), since their ability to stream data into relational databases made me significantly more comfortable. What I was really looking for was a filter between the firehose of a data stream and the calm, organized environment of a relational database. I know I can handle databases, and I know I can create apps that read data from databases, so I wanted a tool that would put me back in my comfort zone.

So, I built one!

“What do you mean, you built one?”

I mean that I was inspired by other streaming SQL solutions to build my own method of turning streaming JSON data (the current standard for most web APIs) into organized records in a relational database. I built my streaming solution on top of NanoDB, a relational database built by Donnie Pinkston, a lecturer at Caltech. Donnie built NanoDB for instructional purposes, and during winter 2017 I took his Relational Database Implementation class, where he deleted large chunks of the relational database system and we had to reimplement them. We worked on join planning, plan optimization, table statistics and plan costing, and B+ tree indexes, among other things. After taking this class, I was very familiar with the NanoDB codebase (written in Java), and felt it would save time to write my streaming solution on top of this database system instead of, say, PostgreSQL.

I call my database TrickleDB (because it’s a Nano stream, heh). TrickleDB can be used to connect to any source of JSON data and stream that data into a relational database. It can also be used to create tables of streaming aggregate functions, which can track the data stream over time. In theory, you could add TrickleDB on top of any relational database, not just NanoDB. It is functional and I am proud of it and I wanted to share its architecture with the world because designing this software was the most fun I’ve ever had in front of a computer (and yes, I do watch Silicon Valley on HBONow).

Design Choices

The TrickleDB system is modeled after a tree hierarchy. There is one global StreamManager object, instantiated on startup, which can have zero or more Stream objects. Each Stream object can have zero or more StreamView objects, and must have one Source object.

A class diagram of the four objects that make up the TrickleDB system.

To set up the system, a user must start a NanoDB server, which instantiates the StreamManager, and then use the following SQL to create a Stream (for this example, I made up a Twitter API endpoint that would return JSON data):

CREATE STREAM tweets SOURCE (tweet_id INT, number_of_likes INT, text VARCHAR(140));

The Stream tweets creates a new Source object, which operates a connection to the made-up Twitter API endpoint. The Stream also creates a “buffer table,” a relational database table with a schema based off of the column names and data types passed in with the CREATE STREAM command (plus two extra attributes: a timestamp for the new data when it arrives over the connection, and a boolean that tracks if the record has been “processed”).

Once I tweet something new, the JSON event describing the new tweet will be available over the Source’s connection to the Twitter API. The Source will parse the JSON and generate a SQL INSERT statement for it and store this statement in a circular queue.

Take this tweet for example:

Tweet #1 for our example

The generated INSERT statement that gets added to the queue is:

INSERT INTO tweets VALUES (12345, 1, “Baby’s first research poster 👩‍💻👩‍🏫 ”, 1495224060, 0);

Every so often, the StreamManager will run a process that executes all of the INSERT statements in the Source’s queue, adding the raw data parsed from the JSON events into the Stream’s buffer table.

This is what the buffer table looks like after the StreamManager executes the INSERT statement in the Source’s queue.

Lets say the user wants to track the total number of likes I get on all of my tweets. The user will create a StreamView to track this aggregate function on the raw data. To create a StreamView, the user will write a command like this:

CREATE STREAMVIEW tweets_aggregate ON tweets (SUM(number_of_likes));

This command creates a new StreamView object called tweets_aggregate that belongs to the Stream tweets. This StreamView creates a new table with a schema based on the passed-in aggregate functions on attributes of the Stream’s schema (in this case, SUM(number_of_likes)).

The StreamView registers all of the aggregate functions that its schema needs to keep track of, and passes this information to the Stream. For example, tweets_aggregate will pass “SUM(number_of_likes)” to the tweets Stream, but if tweets_aggregate were tracking an average instead of a sum, it would pass both SUM and COUNT to tweets, instead of AVERAGE. We’ll get back to complex aggregates later.

The Stream uses the names of the aggregate functions from all of its StreamViews to create a temporary table (whose schema contains all of these aggregate functions).

The StreamView’s table.

Periodically, when triggered by the StreamManager, the Stream will perform these aggregate functions on its buffer table and insert all of the aggregate values into the temp table.

After the Stream is finished aggregating all of the data in its buffer table, it lets its StreamViews know that they can coalesce the data in their tables with the data in the Stream’s temp tables.

Let’s say I tweet again.

Tweet #2 for our example

This tweet generates the following INSERT statement in the Source’s queue:

INSERT INTO tweets VALUES (12346, 1,“Felt very smart after memorizing 74 SCOTUS cases for my history final, until I realized I put my underwear on backwards this morning”, 1496832780, 0);

This is inserted into the Stream’s buffer table:

The buffer table after the new INSERT statement has been executed by the database server.

Note that the StreamManager handles garbage collection in a separate process, periodically deleting “processed” data from each Stream’s buffer table.

Next, the Stream populates its temp table with aggregates of all non-processed data in the buffer table:

Then the StreamView coalesces its table with the temp table:

I am not very popular on Twitter

Here is another way to visualize data moving through the TrickleDB system, if you prefer seeing everything in one place.

On the left side is an example of the TrickleDB architecture. Red arrows represent data flow, blue arrows represent ownership. The right side shows the different forms that data takes as it flows through the system. The colored boxes represent the form that data takes during each step of processing. The blue box represents the data received over the connection on the Source’s URL. The pink box represents the INSERT statement parsed from the raw JSON and stored in the Source’s circular FIFO queue (the buffer). The orange box represents 1. The record in the Stream’s buffer table, 2. The INSERT statement used to populate the Stream’s temp table. The green box represents 1. The record of aggregates in the Stream’s temp table, and 2. The coalesced record in the StreamView’s view table.

It is at the coalescing stage that more complex aggregate functions are handled. For example, if we wanted to track a streaming average of the number of likes on all of my tweets, the StreamView would collect SUM and COUNT functions in addition to the AVERAGE, so its schema would look like this:

The schema of the StreamView needed to track a streaming average.

The StreamView would ask the Stream to track the SUM and COUNT functions on the number_of_likes attributes in its buffer table, and then coalesce only the SUM(number_of_likes) and COUNT(number_of_likes) columns in its temp table. After the StreamView updates the SUM and COUNT columns in its table with the aggregated batch from the Stream, the StreamView performs the arithmetic required to update its AVG(number_of_likes) column.

This is the Stream’s temp table schema to track AVG(number_of_likes).
This is the Stream’s temp table updated with the number of likes on the records currently in the buffer table.
This is the StreamView’s table after coalescing with the Stream’s temp table.
Finally, this is the StreamView’s table after updating its complex aggregate function (the average) with the newly coalesced simple aggregate functions.

Note also that one could have many Streams, each with many StreamViews, running at the same time into your relational database. If you wanted to track averages in one StreamView table, and MIN and MAX in another, you could just create another StreamView on the same Stream to track different values on the same streaming data.

In the Future

I would like to implement a sliding window functionality to StreamView tracking. This would allow a user to track only the last hour, last day, last six months, etc. of a data stream in a StreamView, and would allow users to have multiple StreamViews on the same Stream tracking different-length sliding windows. This could be implemented on the user side with syntax like this:

CREATE STREAMVIEW name ON stream_name ([{aggregate_function}[, …]])[WINDOW range_start TO range_end];

Where the range_start and range_end keywords are in SQL DATETIME format. This functionality would make being able to create multiple StreamViews on the same Stream more useful, as one could track streaming averages over different length windows in different tables.

Parting Thoughts

I loved working on this project, because it combined three of my favorite things: writing code, reading docs, and playing with data. I also simultaneously partook in my fourth favorite activity: drinking coffee.

TrickleDB is not yet public on my GitHub page because I have not yet fully separated it from the underlying NanoDB server. I will, in the future, make this project open source and available to sit atop any relational database your heart desires.

Finally, I had a blast writing this blog post and I look forward to writing more in the future! Please leave a comment with any constructive criticism you have on my writing/diagrams that will help me make my posts more clear or useful in the future.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store