A Data Warehouse for Real-Time Crypto Market Feeds — Part 1
This post was originally published on the Block Street Blog
Optimizing some of our automated strategies requires us to develop a rich understanding of market dynamics on very short timescales. Price data as granular as one sample per minute is available through free tools like Enigma’s Catalyst, but given that variations on the order of milliseconds can meaningfully impact the profitability of our strategies, such data is too coarse-grained for the type of analysis we’d like to conduct. In theory, the highest resolution data we could hope to obtain from an exchange for a given trading pair would include every trade as well as every update to the order book, effectively allowing us to recreate the state of a market at any given point in time¹. Fortunately, this data is freely available, as many exchanges offer real-time feeds of market data via WebSocket APIs².
The WebSocket streams we use for data ingestion are often the very same streams responsible for providing real-time updates to the exchange’s trading dashboard.
WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. Here we see a message that contains a batch of order book updates:
The updates are expressed as arrays of bids and asks that were either added to the book or removed. Code running in your browser processes these updates, resulting in a real-time rendering of the market’s order book.
This Python script simply demonstrates that we can programmatically access the same stream of order book updates that we saw in the browser:
In practice, our system does not read a single stream, but rather thousands of different streams, covering various data feed types for all trading pairs across multiple exchanges³. Each stream is managed by a dedicated “worker.” Each worker builds up a buffer of incoming messages, then, after a fixed time interval, flushes the buffered messages into a file (thus emptying the buffer) and finally writes the file to Amazon’s S3 file storage. The buffer-flush cycle repeats.
Here’s what the content of one of these files looks like:
We can search for specific records within the millions of files we’ve written to S3 using another tool from Amazon called Athena, a serverless query service that spares us the setup and maintenance required for a traditional database. After a bit of basic configuration that essentially tells Athena what to look for inside these files and where to look for the files themselves, we can now search through our data using standard SQL syntax.
Note the “directory” structure (S3 directories are actually just name prefixes) we use to write our files to S3:
This structure functions as an important optimization for when Athena retrieves the data. Athena has a concept of “partitions.” Partitions are simply mappings between virtual columns (in our case pair, year, month, and day) and the S3 directories where the corresponding data is stored. By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Our code automatically adds partitions when we start writing data to new S3 directories.
It’s worth taking a moment to appreciate the value proposition of S3 + Athena. These services are far cheaper and far more scalable than a bona fide SQL database, yet we can search for data using the same familiar query language. Granted, Athena can sometimes be slower than a proper SQL database and the complexity of queries is limited, but given that we mostly use it for ad-hoc investigations and fetching data sets for further analysis in Python, our workflow has been unencumbered by these minor shortcomings.
Speaking of Python, here’s an example of an interface we built to fetch records from Athena. It’s a thin wrapper around Amazon’s boto3 library with a few useful steps baked in.
I’m glad I could provide a brief glimpse into a project I’ve enjoyed working on yet have had few opportunities to discuss (turns out real-time data ingestion doesn’t make for great small talk). In future posts, I’ll cover the motivation behind some of our design decisions, examples of specific insights we’ve been able to glean from these data sets, planned improvements, and more.
 The updates provided through the WebSocket streams are not sufficient for reconstructing order books. We need to periodically fetch snapshots of the order books and store those as well, which we can do using an exchange’s REST API. Then we can fetch a snapshot and apply the corresponding updates from the streams to “replay” the order book.
 We found several third-party vendors selling subscriptions to these data sets, typically in the form of CSV dumps delivered at a weekly or monthly cadence. This presented the question of “build vs buy.” Given that we felt capable of building a robust and reliable system for ingesting real-time market data in a relatively short amount of time and at a fraction of the cost of purchasing the data from a vendor, we were already leaning in favor of “build.” Further investigation made “buy” look like an increasingly unattractive option. Disclaimers issued by multiple vendors about their inability to guarantee data quality/consistency did not inspire confidence. Inspecting sample data sets revealed that some essential fields provided in the original data streams were missing — fields necessary for achieving our goal of being able to recreate the state of a market at an arbitrary point in time. We also recognized that a weekly/monthly delivery schedule would restrict our ability to explore relatively recent market data.
 All the connections required for such broad coverage as well as the resulting flood of incoming data raise some obvious concerns about data loss. We’ve taken several measures to mitigate such concerns, including the implementation of alerts for when non-contiguous messages are detected as well as a redundancy mechanism that allows us to spin up an arbitrary number of instances to read from the same data streams and then dedupe identical messages. It’s worth noting that, while these precautionary measures are useful, they’ve rarely been necessary — since first deploying this system, the only significant data loss we’ve encountered has been attributable to exchange downtime.