Measuring vertical scalability for time series databases in Google Cloud

Time series databases usually have free single-node versions, which cannot scale to multiple nodes. Their performance is limited by a single computer. Modern machines may have hundreds of CPU cores and terabytes of RAM. Let’s test vertical scalability for InfluxDB, TimescaleDB and VictoriaMetrics on standard machine types in Google Cloud.

Setup

The following TSDB docker images were used in the test:

InfluxDB and VictoriaMetrics were run with default configs, while TimescaleDB config has been tweaked using this script for each n1-standard machine type in Google Cloud. This script tunes Postgresql for the available RAM on each machine type.

Time Series Benchmark Suite (TSBS) was used for the test. The following command has been used for generating test data:

./tsbs_generate_data -use-case="cpu-only" -seed=123 -scale-var=4000 \
-timestamp-start="2019-04-01T00:00:00Z" \
-timestamp-end="2019-04-04T00:00:00Z" \
-log-interval=10s \
-format=${TSDB_TYPE} | gzip > ${TSDB_TYPE}-data-40k.gz

The data has been generated for the following $TSDB_TYPE values:

  • influx
  • timescaledb

VictoriaMetrics has been fed with $TSDB_TYPE=influx data, since it supports Influx line protocol.

The generated data consists of 40K unique time series. Each time series contains 25920 data points. The summary number of data points is 25920*40K=1036800000. This number may be rounded to a billion.

The following n1-standard machine types were used for both client and server machines:

  • n1-standard-1: 1vCPU, 3.75GB RAM
  • n1-standard-2: 2vCPU, 7.5GB RAM
  • n1-standard-4: 4vCPU, 15GB RAM
  • n1-standard-8: 8vCPU, 30GB RAM
  • n1-standard-16: 16vCPU, 60GB RAM
  • n1-standard-32: 32vCPU, 120GB RAM
  • n1-standard-64: 64vCPU, 240GB RAM

We’ll see later the difference between vCPU and CPU.

Time series databases were configured to store data on a dedicated 2TB zonal standard persistent disk. The disk has the following characteristics:

  • IOPS: 1500 read, 3000 write
  • Throughput: 240MB/s read, 240MB/s write
  • Price: $80 / month

Data ingestion

The data ingestion tests were run with --workers=2*vCPUs config on the client side. This guaranteed full CPU utilization on the TSDB side.

Image for post
Image for post
Ingestion rate, thousands of data points / sec (higher is better)

The following notable things are visible on the chart:

  • Sub-optimal scalability between n1-standard-1 and n1-standard-2 machine types. The ingestion performance grew only by 1.6x-1.8x when switching from 1vCPU to 2vCPU for all the competitors. This is related to hyper-threaded vCPUs, which aren’t real CPU cores. Read more about hyper-threading pros and cons on Wikipedia. The following quote is from the official Google Cloud docs:
    vCPU is implemented as a single hardware hyper-thread
  • Almost linear vertical scalability for VictoriaMetrics: the ingestion performance scales from 800K data points / sec on 2vCPU machine to 19M data points / sec on 64vCPU machine.
  • Sub-linear vertical scalability for InfluxDB. It scales from 320K data points / sec on 2vCPU machine to 4.4M data points / sec on 64vCPU machine.
  • TimescaleDB reaches scalability limit at 2.3M data points /sec on n1-standard-16 machine and doesn’t scale further.
  • VictoriaMetrics outperforms both InfluxDB and TimescaleDB for each machine type. The performance gap reaches 19100/4410=4.3 times for InfluxDB and to 19100/2220=8.6 times for TimescaleDB on n1-standard-64 machine type. This means a single-node VictoriaMetrics may substitute moderately sized cluster built with InfluxDB or TimescaleDB. Put it another way, single-node VictoriaMetrics saves infrastructure costs additionally to licensing costs for clustered versions.

Why TimescaleDB didn’t scale on machines with more than 16vCPUs? The following chart answers the question:

Image for post
Image for post
Disk write bandwidth usage, MB/s (lower is better)

As you can see, TimescaleDB writes a lot to persistent disk comparing to competitors and reaches write bandwidth limit — 240MB/s — on n1-standard-8 machine. I have no idea why TimescaleDB exceeds the limit and reaches 260MB/s instead of 240MB/s.

The chart contains other notable info:

  • Both VictoriaMetrics and InfluxDB use much smaller amount of disk write bandwidth comparing to TimescaleDB. This means they may use cheaper disks with lower IO bandwidth comparing to TimescaleDB.
  • VictoriaMetrics has the best optimization for disk IO bandwidth usage. It uses only a half of the available disk IO bandwidth on n1-standard-64 while accepting 19M data points per second.

The final on-disk data sizes after ingesting 1B of test data points shows that TimescaleDB’s disk usage is far from optimal:

Image for post
Image for post

Let’s calculate how many data points each TSDB may store on a 2TB disk using data from the graph above:

  • VictoriaMetrics: 2TB/0.377(Bytes/data point) = 5.3 trillions
  • InfluxDB: 2TB/0.566(Bytes/data point) = 3.5 trillions
  • TimescaleDB: 2TB/29(Bytes/data point) = 69 billions

TimescaleDB requires 77 (seventy seven) times more disk space for storing the same amount of data points comparing to VictoriaMetrics. In other words, it requires disk space costing $80*77=$6160/month for storing the same amount of data as VictoriaMetrics with a disk costing $80/month.

While 69 billions of data points look really big from the first sight, this isn’t true. Let’s calculate how many data points are collected from a fleet of 100 nodes with node_exporter installed on each node. Node_exporter exports 500 time series on average, so 100 nodes would export 100*500=50K unique time series. Let’s suppose these time series are scraped with 10s interval. This results in an ingestion stream of 5K data points per second. A month-worth data would contain 5K*3600*24*30=13 billions of data points. So, TimescaleDB would fill the entire 2TB storage in 69/13=5 months.

Both VictoriaMetrics and InfluxDB store data in LSM trees, while TimescaleDB relies on data storage from Postgresql. Charts above clearly show that LSM trees are better suited for storing time series data comparing to general-purpose storage provided by Postgresql.

Querying

TSBS queries may be split into two groups:

  • “Instant” queries, which are performed in less than 100ms. Such queries have little opportunity to scale on machines with higher number of CPUs and RAM, so their results look similar. The only notable thing is extremely slow query performance on TimescaleDB if the queried data is missing in the OS page cache. “Extremely slow” means 100x-1000x slower comparing to the case when the queried data is in the OS page cache. This means that TimescaleDB scatters the queried data across the entire disk, so many I/O operations must be performed for gathering all this data from the disk.
  • “Heavy” queries, which usually require more than a second for execution. Such queries have good opportunities to scale with the number of CPUs, so let’s stick to a query from this group — double-groupby-1. This query scans 12 hours of data for 4K of time series. Each hour contains 360 data points, so the query must scan at least 360*12*4K=17.3M data points.

The following chart shows rpm (requests per minute) for double-groupby-1 queries serially performed by a single client worker:

Image for post
Image for post
double-groupby-1, single client, rpm (higher is better)

Interesting things from the chart:

  • InfluxDB has poor vertical scalability for “heavy” queries, while TimescaleDB and VictoriaMetrics have better scalability for such queries.
  • TimescaleDB performs poorly on n1-standard-1, n1-standard-2 and n1-standard-4, because the stored data doesn’t fit OS page cache — n1-standard-4 machine has only 15GB of RAM, while TimescaleDB’s data occupies 29GB on disk. This translates to heavy disk IO.
  • VictoriaMetrics shows the best vertical scalability with the number of CPU cores — from 62 rpm to 283 rpm, 4.5x scalability.
  • VictoriaMetrics outperforms both contenders on “heavy” queries:
    * InfluxDB by up to 23x
    * TimescaleDB by up to 9x
    Put it another way, VictoriaMetrics performs the query in 0.2s, while InfluxDB performs the same query in 4.9s on n1-standard-64 machine.
  • Performance for “heavy” queries stops scaling starting from n1-standard-32. This may be related to NUMA nodes, where different CPU cores have different latencies when accessing the same RAM regions.

The next chart shows the maximum possible rps for double-groupby-1 query. The corresponding test runs N concurrent client workers, where N is the number of vCPUs on the server. This results in full CPU utilization on the server.

Image for post
Image for post
double-groupby-1, clients=vCPUs, rpm (higher is better)

The chart shows good vertical scalability for all the contenders until n1-standard-32. TimescaleDB scales further on n1-standard-64, while InfluxDB and VictoriaMetrics have no gains in the maximum query bandwidth when switching from n1-standard-32 to n1-standard-64. It is likely TimescaleDB has better optimizations for NUMA nodes. But VictoriaMetrics without NUMA-aware optimizations serves 6.5x more queries comparing to TimescaleDB even on n1-standard-64 machine.

Conclusions

  • Modern time series databases have decent vertical scalability for both data ingestion and querying.
  • TimescaleDB quickly reaches disk bandwidth limit. The limit may be lifted by using more expensive disks with higher read / write bandwidth such as high-end SSDs.
  • TimescaleDB requires much more storage space comparing to VictoriaMetrics and InfluxDB for storing the same amount of data points. The most expensive part of the long-term storage for huge amounts of time series data is disk space. It is unclear how TimescaleDB deals with this issue.
  • TimescaleDB may perform poorly on queries touching data missing in the OS page cache, since it looks like it scatters data for a single time series across the entire disk.
  • VictoriaMetrics provides the best vertical scalability for both data ingestion and querying. This means that a free single-node VictoriaMetrics instance may easily substitute decent cluster built with InfluxDB or TimescaleDB. This saves money on both infrastructure and license costs.
  • Google Cloud provides good vertically scalable machines and durable disk storage with consistently high IOPS and read / write bandiwdth, which are suitable for modern time series databases.

Vertical scalability is limited by per-machine resources — CPU, RAM, storage or network. Users have to switch to cluster solutions when single-node solution reaches scalability limits. TimescaleDB, InfluxDB and VictoriaMetrics provide paid cluster solutions for users that outgrow a single node. Guess which cluster solution is the most cost-effective from infrastructure and licensing point of view :)

Update: VictoriaMetrics is open source now!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store