One of the key components of nearly any software system is the database used to persist, retrieve, and analyze data. In this technical post, we talk about the specific characteristics that are important to us in the database serving our Algorithmic Trading Platform. We also take a few contenders for a spin and see how they stack up.
There isn’t a single perfect database that provides support for transactions, time-series data management, and ultra-fast analytics, along with being free or affordable. MemSQL comes close, and combined with its free tier for licensing, is a worthy contender for anyone looking to set up a multi-faceted database for an algorithmic trading system.
We would be remiss if we didn’t give an honorable mention to ClickHouse, which is fast, capable, and free, but just shy of being ready for prime time.
Taking a broad and somewhat simplified view, databases provide the ability to 1) record and manage data (OLTP) and 2) analyze data (OLAP). Most databases are good at one or the other, and even beyond these broad categories, databases have to face competing goals such as performance vs support for ACID transactions.
As an example, a relational database that is good at consistent and durable transactions may suffer in performance as it needs to lock its data structures and flush all writes to the disk. Conversely, a database that prioritizes performance may need to use a relaxed model of consistency.
This prioritization of certain capabilities and characteristics over others gives rise to different classes of databases that are suitable for different use-cases.
Data Storage for an Algorithmic Trading System
What if we wanted the best of all worlds — support for durable, consistent storage of large amounts of data and blazing-fast real-time analytics? While computer science theory warns us against wanting it all, some clever engineering can get us most of the way there.
The main design goals are:
- Fast ingest of a large number of events into durable storage (~250K-1M events/sec for several hours each day; we expect each event to be ~100–200 bytes and have 10–30 fields)
- Real-time analytics, including aggregations
- Ability to process vast amounts of historical data for patterns and trends
The solution can be constructed in a few different ways, but almost always consists of layering multiple data stores that offer complementary capabilities. An example of such a solution looks like this:
- Store all events from the trading system into a fast data store such as an append-only file journal (potentially replicated or recreated on multiple nodes of a highly-available system). This file provides durable storage but no real query capabilities.
- Ingest the contents of this journal file sequentially into an in-memory database/cache. This can be fast, even real-time, as there are no consistency checks, replication, or persistence requirements at this layer. This layer should provide real-time aggregations and analytics.
- Persist the contents of the in-memory database to disk periodically (hourly, end-of-day, etc.). Use a database that can operate over vast volumes of data stored on disk. The operations on this stored data are considered offline or batch-mode by their very nature, and do not come with an expectation of instantaneous response times.
- Optionally, ingest only the relevant parts of the journal file into a relational database for daily/monthly reporting. For example, only orders and executions are loaded into a relational database, while market quotes are skipped.
There are ways to simplify this pipeline of events — notably steps 2, 3, and 4 can sometimes be combined to use a single tool that offers multiple modes of storing and analyzing data. That is what the rest of the article is about — selecting a database tool that best fits this niche set of requirements.
Requirements for our Database
Below is the list of non-technical requirements we have:
- Cost: Being a cost-conscious start-up, we are looking for something free or relatively cheap. We don’t think this is a crazy requirement given the many FOSS alternatives these days. This requirement means that the standard paid databases like Oracle and MS SQL Server are not contenders.
- Good documentation and community support: If we are not paying for license and support, we will need good documentation and another way of getting our questions answered. This could be a mailing list, an active online community, or simply StackOverflow.
- Operational tools: We prefer a relatively mature product that comes with tools to set up, manage, and monitor deployments, including possibly multi-node clusters.
Now, on to the technical requirements:
- Fast ingest: We need the database to be able to ingest at upwards of 250K inserts/sec, the higher the better. If we need to batch inserts, that’s acceptable, and if we need to use multiple threads or connections, that’s fine too.
- Fast aggregations: We intend to use the Event Sourcing pattern for our system [See description here, here, and here]. As prescribed by this architectural pattern, we will record all state changes in the system as discrete immutable events. In order to recreate the latest state of the system from these events, we need support for fast in-memory aggregations, including window functions, upserts, and possibly other cross-sectional aggregations.
- Time-series operations: Support for operations such as time-bucketing, moving-window aggregations, and as-of joins.
- Expressive query language: SQL is ok, but often not expressive enough for advanced analytics. Ideally, the database would support data access and processing using a functional language with vectorized operations. The ability to create user-defined functions or server-side scripts are also helpful.
- In-memory tables: For fast analytics on working sets of data
- On-disk tables: We expect most databases in this genre to use column-oriented storage. This visual from ClickHouse does a great job of explaining why.
- The database should support an optimized on-disk data layout, which can be a significant contributor to the performance. One possible example of such a layout is:
- Data is partitioned by date and stored in segments for easier data management
- For each partition, data is sharded by Symbol across multiple nodes for parallelism and redundancy
- Within each partition and shard, data records are clustered by (Symbol + Exchange) to facilitate sequential disk reads
- Finally, within records for each clustered key, data is sorted by timestamp for faster time-series operations
- Further, data may be compressed on the disk, which has the benefit of reducing the total amount of data read from the disk
- Tiered data storage: The database could also support a tiered storage strategy for moving older data onto slower, and thereby cheaper, storage.
Here’s the approximate scale of data we are evaluating for:
- Daily data growth: 50–100 GB (uncompressed); ~1B records
- Historical data (eventually): 100 TB (uncompressed); ~1T records
The Test Setup
All of our tests were conducted on either a single or two AWS dedicated instances (m5n-2xlarge). These instances include 8 vCPUs, 32GB RAM, and 100–200GB SSD volumes, running the Amazon Linux 2 AMI.
We are aware that these are modestly sized instances, especially when it comes to memory, for some of the databases tested. This was intentional for two reasons. First, we think that these resources should be plenty for the tests we wanted to conduct, and second, we wanted to see how the tools would degrade or fail in the event that the resources weren’t deemed enough.
Within the time constraints we had, we made our best effort to configure each tool to perform its best, but we may not always have used the recommended configuration, hardware, or the number of nodes. We did try to follow the documentation and set up the data layouts in an optimal way (e.g., sharding scheme).
The actual tests we performed consisted of:
- Load NYSE TAQ data for a single day (files for 20180730). This involved loading 35 million trades into a table, as well as 719 million quotes into another table. As such, we don’t intend to use this database for tick data analysis, but this certainly forms a great sample data set.
- For each trade, find the prevailing quote for the symbol on the same exchange as the trade. We expect this query for a single busy symbol (say, SPY) to take under a minute, and we expect this query to complete for all symbols in under 30 minutes. This is a test of the query language’s ability to represent complex joins and the database’s ability to execute them in a reasonable time.
- For each symbol, calculate the count, average size, and VWAP of trades for each minute of the trading day. We expect this to take no more than 10 seconds for the entire trades table.
- Calculate OHLC bars for each symbol for each minute of the trading day
- Calculate the time-weighted average spread for each symbol for the trading day. This is an interesting test for two reasons: 1) determining the duration of a quote entails the usage of a window function such as
next2) every single quote must be processed, so this is a test of the raw scan speed.
We will note that we have extensive prior experience with kdb+, and our response time expectations are mostly borne out of that experience. We are not aware of any tool faster than kdb+ for raw single-core speed; however, we do not include kdb+ in the list of contenders, primarily due to its price, but also because of the steep learning curve and lack of operational tools.
While databases are the most common data stores, working directly with flat files is a real contender because it affords us the most flexibility in how we store data. These days, there are multiple tools available that can operate efficiently on flat files stored on local disks or S3 buckets such as: Python (Pandas with Jupyter), Apache Spark, Amazon Redshift Spectrum, and even clickhouse-local.
We tried an EMR (Elastic Map Reduce) cluster with Apache Spark on AWS, and while it was relatively easy to set-up, it took us a while to figure out how to load data from files and JDBC sources and work with Spark datasets and PySpark dataframes. Our general conclusion was that this may be usable for batch analytics with the appropriate scale-out but not as a primary database. We will admit that our lack of familiarity with Spark and Hadoop did not help here.
Even so, we can see an elaborate system consisting of files and directories organized in the right way, with corresponding tools and scheduled jobs, that could be a viable alternative for power-users who are able to allocate the appropriate resources. For us though, we decided this might be too fragile and unstructured, and we need something with a few more bells and whistles.
We included MySQL only as a starting point, mostly to confirm that a traditional RDBMS is not really the right answer for us. MySQL is not a time-series database, is not column-oriented, and does not support the type of advanced analytics or performance that we are looking for.
The advantages are that it is free, has a massive community around it, and fans will claim you can make it do anything if you only know how. In our tests, MySQL (InnoDB engine) could not keep up with fast, batched inserts of 250K/sec across a connection pool, and the insert rate dropped as the table grew to a few million records. The data sizes on disk looked quite large, and the response times when querying the few million records were in multiple seconds. Joining tables with several million records did not complete within an acceptable amount of time, even with indexes that made sense to add.
When proof-reading a draft version of this blog post, a former colleague pointed us to the MariaDB column store. Due to time constraints, we were not able to fully evaluate it, but this link does a good job of comparing it to ClickHouse, which we talk about below.
PostgreSQL and TimescaleDB
PostgreSQL did better than MySQL in our load tests, especially on insert rates and graceful degradation of response times with table sizes, but not well enough for real consideration.
TimescaleDB seems promising — it is a PostgreSQL extension that uses large numbers of regular PostgreSQL tables to create a virtual table that they call hypertable. All queries and operations on the hypertable are passed down to the appropriate chunk tables. The primary purpose is to improve insert rates and generally provide predictable query times when dealing with large data sizes. TimescaleDB also throws in some time-series related features to help with analysis.
If it worked as advertised, it would be the ideal solution, but for us, it did not. The initial insert rates were promising (250K/sec), but we were not able to ingest the 35M trade records — it inexplicably kept running out of memory. We also noticed that the text file loader was not able to utilize all cores available on the servers. When ingesting data, we observed much higher IOWait times on the server than other databases, possibly because of the lack of on-disk compression. There was also very high disk space usage — the stored data took more space than fully uncompressed text data, which is unusual (preallocation perhaps?). We are aware that native compression is supported in recent builds, but we were not able to utilize it automatically for newly ingested data.
ClickHouse is the new kid on the block with nearly all of the features we could dream of. It is FOSS, blazing-fast, horizontally scalable, fault-tolerant, hardware efficient, and with advanced data management on disk (including tiered storage). The development process is very transparent, with an active community on Github, and releases come out every 2–3 weeks with new features, improvements, and fixes. The documentation is good and it is easy to get questions answered from the maintainers.
ClickHouse is primarily an OLAP engine and has no real transactional support to speak of — for example, updates and deletes of inserted data are not supported, except through an awkward asynchronous
ALTER TABLE command. It also does not support window functions (except for special cases like
runningAccumulate), which is a bit of a surprise given the heavy time-series focus.
We tested ClickHouse on a single node without any of its replication features turned on. ClickHouse was able to load the 35M trades and 719M quotes at an ingest rate of over 1M/sec. It achieves this high rate by using a unique on-disk data structure (MergeTree) to write the data to temp files as quickly as possible and then consolidating them in the background. It never ran out of memory (with one exception) and used just over half of the disk space used by the gzipped source files, which is quite efficient. As seen in the table below, ClickHouse put up impressive performance numbers for nearly all of the queries.
Alas, there are some deal-breaker issues we cannot overcome:
- The only way to issue queries is by using a SQL-like query language with some severe limitations: only a single select statement can be issued per request, and there is no support for user-defined functions (UDFs) or Stored Procedures.
- Their philosophy can be summed up as my-way-or-the-highway. There are multiple tickets where the maintainers provided dismissive responses to reasonable user requests such as support for sub-second resolution in the datetime data type. To be fair, there are legitimate reasons for some of the responses, but still, it is disconcerting to see some of these exchanges.
All in all, we still think ClickHouse has a lot of potential and we’ll keep a close eye on its evolution. We may even find a way to deploy it within a non-critical part of our system, just to be able to learn and use it.
DolphinDB is an exotic, specialized product that we were completely unaware of before this evaluation. It presents itself as a fast, distributed time-series analytical database and a viable alternative to kdb+. Coming from a kdb+ background, this piqued our interest enough for us to take it for a spin even though this is a paid product.
Our overall impressions are overwhelmingly positive. It is faster than ClickHouse and possibly even faster than kdb+, based on our prior experience. It has native support for multi-node clusters, a rich functional programming language, and optimized in-memory as well as on-disk data structures. It loaded our 35M trades into a table in just 6 sec! It performed an as-of join between all trades of SPY with their prevailing quotes in just 358 ms, and the same join for all symbols in just 25 sec! We will note that these are ultra-fast numbers (that last query takes ~5 min on kdb+). Also, the disk usage of the stored data was less than half of the gzipped source files.
Some of the advanced features (that we didn’t test) include: support for streaming and publish/subscribe, real-time aggregation/windowing engine, real-time anomaly detection engine, advanced statistical analysis functions, and machine learning functions
Despite being blown away by the product, there are still a few negatives which we have not been able to overcome:
- Cost: While it appears to be cheaper than kdb+, it is still far too expensive for us
- Non-standard language with a learning curve (though much easier than kdb+); this is mitigated by having excellent documentation
- For a business-critical component, can we really consider paying for a closed-source product that is unproven (to us) and whose limitations we can’t yet state? To add to this hesitation, there were a couple of crashes and inexplicable out-of-memory conditions, which didn’t help.
Still, it appears we may have discovered something faster and richer than kdb+, which is high praise. We will keep an eye on this one and if a strong need arises for a product with these capabilities (such as a tick data research environment), we will certainly consider it.
This brings us to our current pick — MemSQL. Yes, MemSQL is a paid product, but it does provide a free commercial license for a starter cluster, which can include up to 4-nodes, 128 GB memory, and unlimited on-disk data. We think this will be enough to get us to a point where we can consider paid products.
MemSQL categorizes itself into a new class of databases called HTAP (Hybrid transactional / analytical processing). The main selling point of MemSQL is that it offers fast analytics while having rich transactional support and being SQL compliant to the extreme. It even goes so far as to be wire-compatible with MySQL, so you can use all of the MySQL tools and drivers to work with MemSQL. This is great for integration with a vast ecosystem of tools, but it is also somewhat of a handicap, in that it is hard to represent some of the advanced analytics using pure SQL. We accepted this particular drawback due to extensive procedural language support in UDFs and Stored Procedures [caveat: the procedural approach is at least an order of magnitude slower than the usual vectorized operations].
MemSQL supports in-memory rowstore tables as well as on-disk columnstore tables with sharding, sorting, and compression (they also released a hybrid singlestore format recently). We tested exclusively with Columnstore, especially given that we only had 32GB memory on our test instances. MemSQL was among the easiest tools to work with in terms of deployment, administration, monitoring, cluster setup, and even loading and querying of data.
We were able to load our trades and quotes at over 500K records/sec. We noticed that the load process on the server was able to parallelize ingestion using multiple cores. The loaded data took about the same amount of space as our gzipped source files. We also observed that using the JDBC interface, an external tool was able to read data out from MemSQL at well over 1Gbps, which is fairly impressive.
The overall performance of most single-table queries, as well as multi-table join queries, was quite good. It didn’t do too well on the as-of joins, but it is simply not designed for that use-case. We spent many hours trying to represent an as-of join optimally in SQL until we were finally able to coerce the engine to perform a (relatively) fast MergeJoin. It is not inconceivable that the vendor could add specialized support for the as-of join as a custom operation in the future.
All-in-all, MemSQL was the most balanced solution we could find in our investigations. It is mature, easy to use, free (for now), fast, efficient, and interoperable with all the standard tools that we can hope for.
Queries can be found here: https://github.com/prerak-proof/dbtests
We know there are many other tools out there that we could be evaluating, especially the various NoSQL databases. Our general sense is that while those options may be able to handle the scale of our data, they will not be able to satisfy our performance expectations.
At least as of right now, we think MemSQL is the right product for us that fits our needs as well as our constraints. Over the next year, as we put it into production and get more experience with it, we will be sure to share our learnings through another post.
If you have questions or concerns or want me to do a follow up on anything in this post, you can reach me on Twitter: @preraksanghvi