A Data Warehouse for Real-Time Crypto Market Feeds — Part 1

This post was originally published on the Block Street Blog

Optimizing some of our automated strategies requires us to develop a rich understanding of market dynamics on very short timescales. Price data as granular as one sample per minute is available through free tools like Enigma’s Catalyst, but given that variations on the order of milliseconds can meaningfully impact the profitability of our strategies, such data is too coarse-grained for the type of analysis we’d like to conduct. In theory, the highest resolution data we could hope to obtain from an exchange for a given trading pair would include every trade as well as every update to the order book, effectively allowing us to recreate the state of a market at any given point in time¹. Fortunately, this data is freely available, as many exchanges offer real-time feeds of market data via WebSocket APIs².

The WebSocket streams we use for data ingestion are often the very same streams responsible for providing real-time updates to the exchange’s trading dashboard.

This is the Binance trading dashboard for the BTC-USDT pair. We can see the order book updating on the left and recent trades appearing on the right.

WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. Here we see a message that contains a batch of order book updates:

The updates are expressed as arrays of bids and asks that were either added to the book or removed. Code running in your browser processes these updates, resulting in a real-time rendering of the market’s order book.

This Python script simply demonstrates that we can programmatically access the same stream of order book updates that we saw in the browser:

In practice, our system does not read a single stream, but rather thousands of different streams, covering various data feed types for all trading pairs across multiple exchanges³. Each stream is managed by a dedicated “worker.” Each worker builds up a buffer of incoming messages, then, after a fixed time interval, flushes the buffered messages into a file (thus emptying the buffer) and finally writes the file to Amazon’s S3 file storage. The buffer-flush cycle repeats.

Files in S3 containing trade data from Binance for the BTC-USDT pair from August 13th, 2018

Here’s what the content of one of these files looks like:

This file contains trade data that one of the workers ingested from a WebSocket stream, batched, and flushed to S3. Each row contains data for a single trade. You can see that the “trade_id” values increment by 1 — a quick sanity check that there are no gaps in this data.
Lots of data for lots of pairs

We can search for specific records within the millions of files we’ve written to S3 using another tool from Amazon called Athena, a serverless query service that spares us the setup and maintenance required for a traditional database. After a bit of basic configuration that essentially tells Athena what to look for inside these files and where to look for the files themselves, we can now search through our data using standard SQL syntax.

Using Athena to fetch all 239,945 trades on Binance for the BTC-USDT pair on August 13th, 2018, ordered by the trade timestamp (ascending)

Note the “directory” structure (S3 directories are actually just name prefixes) we use to write our files to S3:

This structure functions as an important optimization for when Athena retrieves the data. Athena has a concept of “partitions.” Partitions are simply mappings between virtual columns (in our case pair, year, month, and day) and the S3 directories where the corresponding data is stored. By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Our code automatically adds partitions when we start writing data to new S3 directories.

It’s worth taking a moment to appreciate the value proposition of S3 + Athena. These services are far cheaper and far more scalable than a bona fide SQL database, yet we can search for data using the same familiar query language. Granted, Athena can sometimes be slower than a proper SQL database and the complexity of queries is limited, but given that we mostly use it for ad-hoc investigations and fetching data sets for further analysis in Python, our workflow has been unencumbered by these minor shortcomings.

Speaking of Python, here’s an example of an interface we built to fetch records from Athena. It’s a thin wrapper around Amazon’s boto3 library with a few useful steps baked in.

This interface for retrieving Bittrex trade data takes a ticker pair, start date, and end date as arguments then returns the query results from Athena as a Pandas DataFrame. There’s also an optional argument to write the data to disk as a CSV. This query returned 126,434 trades from Bittrex for the BTC-USDT pair from August 1st, 2018 to August 3rd, 2018.

I’m glad I could provide a brief glimpse into a project I’ve enjoyed working on yet have had few opportunities to discuss (turns out real-time data ingestion doesn’t make for great small talk). In future posts, I’ll cover the motivation behind some of our design decisions, examples of specific insights we’ve been able to glean from these data sets, planned improvements, and more.

[1] The updates provided through the WebSocket streams are not sufficient for reconstructing order books. We need to periodically fetch snapshots of the order books and store those as well, which we can do using an exchange’s REST API. Then we can fetch a snapshot and apply the corresponding updates from the streams to “replay” the order book.

[2] We found several third-party vendors selling subscriptions to these data sets, typically in the form of CSV dumps delivered at a weekly or monthly cadence. This presented the question of “build vs buy.” Given that we felt capable of building a robust and reliable system for ingesting real-time market data in a relatively short amount of time and at a fraction of the cost of purchasing the data from a vendor, we were already leaning in favor of “build.” Further investigation made “buy” look like an increasingly unattractive option. Disclaimers issued by multiple vendors about their inability to guarantee data quality/consistency did not inspire confidence. Inspecting sample data sets revealed that some essential fields provided in the original data streams were missing — fields necessary for achieving our goal of being able to recreate the state of a market at an arbitrary point in time. We also recognized that a weekly/monthly delivery schedule would restrict our ability to explore relatively recent market data.

[3] All the connections required for such broad coverage as well as the resulting flood of incoming data raise some obvious concerns about data loss. We’ve taken several measures to mitigate such concerns, including the implementation of alerts for when non-contiguous messages are detected as well as a redundancy mechanism that allows us to spin up an arbitrary number of instances to read from the same data streams and then dedupe identical messages. It’s worth noting that, while these precautionary measures are useful, they’ve rarely been necessary — since first deploying this system, the only significant data loss we’ve encountered has been attributable to exchange downtime.