Timeseries Databases Performance — Testing 7 alternatives. ClickHouse vs. InfluxDB vs. Postgresql Vs. Parquet(S3) vs. MongoDB vs. DuckDB vs. Kdb+.

Everton Kozloski
5 min readJun 24, 2023

--

Introduction:

Driven by the need to produce two Machine Learning models aimed at Trading forecasting, one of them based on Decision Tree algorithms and the other on Reinforcement Learning, I felt the need to improve the inference time of the two models so that the idealized system perform as close as possible to a Real Time application. Several peculiar factors of this pipeline were raised that would need to be analyzed, from an adequate kernel, hardware, scripting language and, among others, to one of the first that I came across: was the choice of an efficient database that met certain criteria:

  • Primarily: the higher speed between writing and reading to write a table stored in a pandas dataframe from a capture (scraping) performed by a python script (later to be implemented in Go Lang) to disk storage, serving both as source of ETL for building models as for inference data;
  • To be a dedicated database for time series: the variation for database storage based on time/value keys (timestamps);
  • Preferably a non-relational database and focused on Analytics too: as the data will be stored in table form only. Specific solutions for this purpose have better performance than transactional databases commonly used in the market;
  • Good concurrency control: both for scalability and for the fact that reading and writing will be continuous and must be performed synchronously.
  • Offer a good OnPromisse solution;
  • Be Free or Opensource;

Selection:

Based on these assumptions, 7 candidates were chosen:
- ClickHouse
- InfluxDB
- Postgresql
- Parquet (in a S3 Mini Storage)
- DuckDB
- MongoDB
- Kdb+

Two of these options: Postgresql, even if is not a database focused on timeseries, in this script the copy strategy was used, which allows writing a table in the database much faster than with the commonly used transactional insert method, but which was included on the list for comparison purposes, and the other: Kdb+, which in my view would be the ideal option for this particular case, all of this without the need to carry out benchmarks, as it meets (and exceeds) all specifications from the others, however, bumping into its annual licensing price of US$ 100,000.00, makes it an no-option outside the current scope for the size of the project, even though it has a free version that can be used for non-commercial purposes, like what is performed in this benchmark.

System Specs:

OS: Proxmox Server 7.3 Debian:
4 x AMD A10–7860K Radeon R7, 12 Compute Cores 4C+8G, 16gb Ram 2x8 Amd Patriot 1600Mhz DDR3
7 Lxc Alpine/Debian virtual machines. Being reserved 2gb and 2 cores each, 1000 gigabit network
SSD Samsung 860 evo 250GB

Scope and Conduct of the Tests:

The test, which is available in the repository code mentioned below, consists of loading a csv file (out.csv) containing 9 columns and 1,000,000 lines, totaling 9 million records, and which contains financial data of the Euro Dollar (EURUSD) variation from past six months, is a result from a scraping extracted to a dataframe of the pandas data analysis library, tested in python code in the jupyter notebook available in the same repository, all the the code is available by me through the following repository on github:

Keep on mind that operations such as loading data and other procedures were writen outside the functions of reading and writing, with only the time spent writing or reading from the target database being timed.
The recording of the time to perform each operation, segmented into reading and writing, was performed with Python’s timeit library:

Running the code until the end, we have a table with the values of the time taken for each insert and select (leaving just 2 zeros after the comma for the floating point values), presented in the following table, being respectively: writing time, reading time and the sum or total time.

Finally we render the result in a graph:

And we filter the top 4 results:

Adding then an additional chart: this being the result of manually collecting the properties of the DBs in Dbeaver (link) comparing the occupied size with the final result of the inserts in the database: in the first position we have the size of the original Csv on disk in Megabits:

The Choice and Final Considerations:

The choice would seem obvious if we only take into account speed, but we still have a few more points to emphasize, even though DuckDb is perhaps the best current tool for performing ETL, being OLAP, it is still similar to Sqlite, storing files in single files (which does not make me feel very little certainty about the reliability of the data), the same goes for the Parquet format, used in Amazon’s S3, I believe that performance will be destabilized when the simultaneous concurrency factor comes into play due to the scalability of a synchronous system, with the need for high 24/7 availability. DuckDb supports concurrency through MVCC (Multi-Version Concurrency Control), however this concurrency optimizes unilaterally either read-only or write-only, as explained here: https://duckdb.org/faq.html and as well as Kdb+ has already been declassified by its high cost, the choice fell to the Click House database, both because it meets the criteria explained at the beginning and because of this efficient compression rate (a 123Mb csv occupied only 20Mb in the database), in fact, it was my choice even before carrying out the project, but I decided that it was necessary to clarify the doubt and, in that process, I learned a lot about the other options. I don’t consider any of the options to be better than the other, just each one of them has its specific use for each context for which they were created.

Click House is the database used by the Russian search engine Yandex, a direct competitor of Google, built for metrics analysis. The project’s source code was eventually published as Opensource in 2016.

Reference:

Links to Databases used:
https://clickhouse.com/
https://www.influxdata.com/
https://duckdb.org
https://www.mongodb.com/pt-br
https://www.postgresql.org/
https://min.io/
https://kx.com/

--

--

Everton Kozloski

Enthusiast in #AI, #MachineLearning and #DataScience, especially in #embed devices #Iot and #CyberSecurity