Measuring vertical scalability for time series databases in Google Cloud

Aliaksandr Valialkin
Apr 10 · 8 min read

Time series databases usually have free single-node versions, which cannot scale to multiple nodes. Their performance is limited by a single computer. Modern machines may have hundreds of CPU cores and terabytes of RAM. Let’s test vertical scalability for InfluxDB, TimescaleDB and VictoriaMetrics on standard machine types in Google Cloud.

Setup

The following TSDB docker images were used in the test:

InfluxDB and VictoriaMetrics were run with default configs, while TimescaleDB config has been tweaked using this script for each n1-standard machine type in Google Cloud. This script tunes Postgresql for the available RAM on each machine type.

Time Series Benchmark Suite (TSBS) was used for the test. The following command has been used for generating test data:

./tsbs_generate_data -use-case="cpu-only" -seed=123 -scale-var=4000 \
-timestamp-start="2019-04-01T00:00:00Z" \
-timestamp-end="2019-04-04T00:00:00Z" \
-log-interval=10s \
-format=${TSDB_TYPE} | gzip > ${TSDB_TYPE}-data-40k.gz

The data has been generated for the following values:

  • influx
  • timescaledb

VictoriaMetrics has been fed with data, since it supports Influx line protocol.

The generated data consists of 40K unique time series. Each time series contains 25920 data points. The summary number of data points is 25920*40K=1036800000. This number may be rounded to a billion.

The following n1-standard machine types were used for both client and server machines:

  • n1-standard-1: 1vCPU, 3.75GB RAM
  • n1-standard-2: 2vCPU, 7.5GB RAM
  • n1-standard-4: 4vCPU, 15GB RAM
  • n1-standard-8: 8vCPU, 30GB RAM
  • n1-standard-16: 16vCPU, 60GB RAM
  • n1-standard-32: 32vCPU, 120GB RAM
  • n1-standard-64: 64vCPU, 240GB RAM

We’ll see later the difference between vCPU and CPU.

Time series databases were configured to store data on a dedicated 2TB zonal standard persistent disk. The disk has the following characteristics:

  • IOPS: 1500 read, 3000 write
  • Throughput: 240MB/s read, 240MB/s write
  • Price: $80 / month

Data ingestion

The data ingestion tests were run with config on the client side. This guaranteed full CPU utilization on the TSDB side.

Ingestion rate, thousands of data points / sec (higher is better)

The following notable things are visible on the chart:

  • Sub-optimal scalability between and machine types. The ingestion performance grew only by 1.6x-1.8x when switching from 1vCPU to 2vCPU for all the competitors. This is related to hyper-threaded vCPUs, which aren’t real CPU cores. Read more about hyper-threading pros and cons on Wikipedia. The following quote is from the official Google Cloud docs:
  • Almost linear vertical scalability for VictoriaMetrics: the ingestion performance scales from 800K data points / sec on 2vCPU machine to 19M data points / sec on 64vCPU machine.
  • Sub-linear vertical scalability for InfluxDB. It scales from 320K data points / sec on 2vCPU machine to 4.4M data points / sec on 64vCPU machine.
  • TimescaleDB reaches scalability limit at 2.3M data points /sec on machine and doesn’t scale further.
  • VictoriaMetrics outperforms both InfluxDB and TimescaleDB for each machine type. The performance gap reaches 19100/4410=4.3 times for InfluxDB and to 19100/2220=8.6 times for TimescaleDB on machine type. This means a single-node VictoriaMetrics may substitute moderately sized cluster built with InfluxDB or TimescaleDB. Put it another way, single-node VictoriaMetrics saves infrastructure costs additionally to licensing costs for clustered versions.

Why TimescaleDB didn’t scale on machines with more than 16vCPUs? The following chart answers the question:

Disk write bandwidth usage, MB/s (lower is better)

As you can see, TimescaleDB writes a lot to persistent disk comparing to competitors and reaches write bandwidth limit — 240MB/s — on machine. I have no idea why TimescaleDB exceeds the limit and reaches 260MB/s instead of 240MB/s.

The chart contains other notable info:

  • Both VictoriaMetrics and InfluxDB use much smaller amount of disk write bandwidth comparing to TimescaleDB. This means they may use cheaper disks with lower IO bandwidth comparing to TimescaleDB.
  • VictoriaMetrics has the best optimization for disk IO bandwidth usage. It uses only a half of the available disk IO bandwidth on while accepting 19M data points per second.

The final on-disk data sizes after ingesting 1B of test data points shows that TimescaleDB’s disk usage is far from optimal:

Let’s calculate how many data points each TSDB may store on a 2TB disk using data from the graph above:

  • VictoriaMetrics: 2TB/0.377(Bytes/data point) = 5.3 trillions
  • InfluxDB: 2TB/0.566(Bytes/data point) = 3.5 trillions
  • TimescaleDB: 2TB/29(Bytes/data point) = 69 billions

TimescaleDB requires 77 (seventy seven) times more disk space for storing the same amount of data points comparing to VictoriaMetrics. In other words, it requires disk space costing $80*77=$6160/month for storing the same amount of data as VictoriaMetrics with a disk costing $80/month.

While 69 billions of data points look really big from the first sight, this isn’t true. Let’s calculate how many data points are collected from a fleet of 100 nodes with node_exporter installed on each node. Node_exporter exports 500 time series on average, so 100 nodes would export 100*500=50K unique time series. Let’s suppose these time series are scraped with 10s interval. This results in an ingestion stream of 5K data points per second. A month-worth data would contain 5K*3600*24*30=13 billions of data points. So, TimescaleDB would fill the entire 2TB storage in 69/13=5 months.

Both VictoriaMetrics and InfluxDB store data in LSM trees, while TimescaleDB relies on data storage from Postgresql. Charts above clearly show that LSM trees are better suited for storing time series data comparing to general-purpose storage provided by Postgresql.

Querying

TSBS queries may be split into two groups:

  • “Instant” queries, which are performed in less than 100ms. Such queries have little opportunity to scale on machines with higher number of CPUs and RAM, so their results look similar. The only notable thing is extremely slow query performance on TimescaleDB if the queried data is missing in the OS page cache. “Extremely slow” means 100x-1000x slower comparing to the case when the queried data is in the OS page cache. This means that TimescaleDB scatters the queried data across the entire disk, so many I/O operations must be performed for gathering all this data from the disk.
  • “Heavy” queries, which usually require more than a second for execution. Such queries have good opportunities to scale with the number of CPUs, so let’s stick to a query from this group — . This query scans 12 hours of data for 4K of time series. Each hour contains 360 data points, so the query must scan at least 360*12*4K=17.3M data points.

The following chart shows rpm (requests per minute) for queries serially performed by a single client worker:

double-groupby-1, single client, rpm (higher is better)

Interesting things from the chart:

  • InfluxDB has poor vertical scalability for “heavy” queries, while TimescaleDB and VictoriaMetrics have better scalability for such queries.
  • TimescaleDB performs poorly on , and , because the stored data doesn’t fit OS page cache — machine has only 15GB of RAM, while TimescaleDB’s data occupies 29GB on disk. This translates to heavy disk IO.
  • VictoriaMetrics shows the best vertical scalability with the number of CPU cores — from 62 rpm to 283 rpm, 4.5x scalability.
  • VictoriaMetrics outperforms both contenders on “heavy” queries:
    * InfluxDB by up to 23x
    * TimescaleDB by up to 9x
    Put it another way, VictoriaMetrics performs the query in 0.2s, while InfluxDB performs the same query in 4.9s on machine.
  • Performance for “heavy” queries stops scaling starting from . This may be related to NUMA nodes, where different CPU cores have different latencies when accessing the same RAM regions.

The next chart shows the maximum possible rps for query. The corresponding test runs concurrent client workers, where is the number of vCPUs on the server. This results in full CPU utilization on the server.

double-groupby-1, clients=vCPUs, rpm (higher is better)

The chart shows good vertical scalability for all the contenders until . TimescaleDB scales further on , while InfluxDB and VictoriaMetrics have no gains in the maximum query bandwidth when switching from to . It is likely TimescaleDB has better optimizations for NUMA nodes. But VictoriaMetrics without NUMA-aware optimizations serves 6.5x more queries comparing to TimescaleDB even on machine.

Conclusions

  • Modern time series databases have decent vertical scalability for both data ingestion and querying.
  • TimescaleDB quickly reaches disk bandwidth limit. The limit may be lifted by using more expensive disks with higher read / write bandwidth such as high-end SSDs.
  • TimescaleDB requires much more storage space comparing to VictoriaMetrics and InfluxDB for storing the same amount of data points. The most expensive part of the long-term storage for huge amounts of time series data is disk space. It is unclear how TimescaleDB deals with this issue.
  • TimescaleDB may perform poorly on queries touching data missing in the OS page cache, since it looks like it scatters data for a single time series across the entire disk.
  • VictoriaMetrics provides the best vertical scalability for both data ingestion and querying. This means that a free single-node VictoriaMetrics instance may easily substitute decent cluster built with InfluxDB or TimescaleDB. This saves money on both infrastructure and license costs.
  • Google Cloud provides good vertically scalable machines and durable disk storage with consistently high IOPS and read / write bandiwdth, which are suitable for modern time series databases.

Vertical scalability is limited by per-machine resources — CPU, RAM, storage or network. Users have to switch to cluster solutions when single-node solution reaches scalability limits. TimescaleDB, InfluxDB and VictoriaMetrics provide paid cluster solutions for users that outgrow a single node. Guess which cluster solution is the most cost-effective from infrastructure and licensing point of view :)

Update: VictoriaMetrics is open source now!

Aliaksandr Valialkin

Written by

Founder and core developer at VictoriaMetrics