How to efficiently store and query time-series data
… or how I learned to stop worrying and love PostgreSQL.
Last summer, when I started to build the first version of ForecastCycles, what worried me the most was: how do I organize historical data for thousands of financial symbols, possibly spanning for decades for each of them?
Apple alone has, for example, almost 40 years of history in the stock market. So what’s the best possible way to store this data, in order to minimize the retrieval time, as well as the execution time of statistical algorithms applied on said data?
Luckily, the problem of storing and querying time-series data is very frequent today and there are many solutions available on the market. But… how do you choose the one that fits your needs? SQL or NoSQL? Let’s find out!
When developing such a system, you need to ask yourself these questions:
- What is my read/write ratio?
- What kind of reads and writes are going to be performed?
- How big do I expect my data to be at first, and at what pace will it grow?
When developing ForecastCycles, I knew that reads would outnumber writes by a huge factor, and that queries would be complex, time-based and with statistics involved. Writes on the other hand would be batched on a daily basis, after the financial markets close time. Finally, from a rough estimate, I expected the dataset to start with 10M data points and grow up to 100M in a couple years.
With that in mind, I started my journey for the best possible fit.
Solution 1: InfluxDB
InfluxData is a company that offers a set of tools for time-series data memorization, visualization, etc… (TICK stack). They are available as open source products, so that you can host them yourself and customize them to fit your needs, or with a commercial license that offers a managed service, with support for clustering and more enterprise features.
The core component of the stack is InfluxDB, a NoSQL database that is said to be a perfect fit for applications which require high availability and high data ingestion rate.
So I went on and set up a single node instance on a medium-sized server, and started to benchmark the initial import phase of financial data, consisting of thousands of CSV files with daily resolution.
The results were… baffling to say the least. Many insertions were failing, the InfluxDB node was crashing over and over, and waiting time was growing big.
At first I thought it was a problem with the wrapper I created using the
node-influx library, which maybe had some performance bottlenecks. So I tried
csv-to-influxdb, a Go library created for the exact purpose of bulk insertion of data in CSV form.
It barely changed anything.
A solution may have been to increase the node capabilities, or possibly to set up a cluster of nodes. However, the truth is that InfluxDB really shines when collecting metrics and events fired from various locations (think IOT), not to hold business-critical data.
Lesson learned: use a NoSQL solution for storing time-series data when you care about ingesting speed and throughput as much as possible, and for data that is not at the core of your application.
Solution 2: TimescaleDB
It basically organizes and indexes data in chunks of time, called buckets, which allows for really fast retrieval when performing range queries. Tables made this way are called hypertables.
Being still in beta when I tried it, I was a bit skeptical about Timescale. However, it looked too good of a match for my requirements so I decided to give it a try.
The setup experience was really pleasant: all I had to do was to install the extension on my database, and call
create_hypertable on the vanilla SQL tables I wanted to add the bucket index on.
Then I proceeded on the bulk import of CSV files, using the wonderful
pg-promise library. And again, I found an unexpected result: the underlying indexes Timescale was creating were so heavy, that waiting times on insertions had become huge.
So I decided to skip the NodeJS wrapper and went on using a Go tool that Timescale developed for ingesting a lot of data. Luckily the process sped up by orders of magnitude, but I started wondering if the bottleneck represented by all those indexes was actually worth it.
Furthermore, at the time of my tests, Timescale did not support any kind of Foreign Key constraint on hypertables, so it was a big no-no for me as referential integrity is one of the main selling points of SQL databases.
Eventually I decided to ditch Timescale and look for other solutions. However, I want to praise their wonderful Slack support and their community, which has evolved a lot since last year. Plus, Timescale now support FKs on hypertables!
Lesson learned: SQL is really powerful and fast, but you need to be careful when placing bets on software in beta state.
Solution 3: PostgreSQL
My journey seemed to come to an end when I tested out the good ol’ PostgreSQL, without any fancy addon or extension.
This ended up being the final schema of the table that holds all the financial data. As you can see, it’s really simple and straightforward: a composite index on both ticker (which references the actual financial symbol) and time, in order to minimize range table scans for specific symbols.
PostgreSQL burned through the tens of millions of records I tried to load using the wonderful
COPY command, with no kind of performance hit. It now has a central role in the application’s backend, performing all the hard queries and returning the results of the computations to the thin API server, which most of the time just embellishes PostgreSQL data and serves it to the clients.
PostgreSQL turned out to be a pretty solid choice as a general purpose database, which means that both customers data and financial time-series data live in the same database, with strong guarantees of referential integrity.
ForecastCycles is a SaaS built with React, Semantic-UI and PostgreSQL. We offer cyclical and statistical analyses on a wide variety of financial symbols, as well as personal coaching for getting the most out of the markets.