VictoriaMetrics, a stress-free Prometheus Remote Storage for 1 Billion metrics

Charles-Antoine Mathieu
Criteo R&D Blog
Published in
9 min readSep 19, 2023
Photo by Mikail McVerry on Unsplash

Starting point

Back in 2020, our homegrown centralized timeseries storage solution, BigGraphite, started to show the usual signs of aging. Could a more modern alternative be able to centralize our metrics for a fraction of the infrastructure cost and operational burden while being more reliable, performant and feature full ?

At that time, our scale involved 1500+ Prometheus instances across eight datacenters worldwide, for about about 1 billion active timeseries, with a resolution of one datapoint per minute with the goal to be able to ingest at least twice as much.

Series : sum(max_over_time(prometheus_tsdb_head_series[30d])) / 1e6 => 1000
Churn : avg_over_time(sum(rate(prometheus_tsdb_head_series_created_total[5m]))[30d:]) => 3000/s

Contenders

In the ever-evolving landscape of Observability, finding the perfect timeseries solution can be a daunting task. At that time, we surveyed the available open-source and those looked like the most viable options:

  1. Cortex: Horizontally scalable, highly available, multi-tenant, long term storage for Prometheus. (v1.4)
  2. VictoriaMetrics: Simple & Reliable Monitoring That Scales (v1.44)
  3. Thanos: Open source, highly available Prometheus setup with long term storage capabilities. (v0.16)
  4. TimescaleDB: PostgreSQL++ for time series and events
  5. M3DB: Open Source Metrics Engine (v0.15)
  6. Warp10: The most advanced/flexible/efficient/awesome/scalable
    Time Series platform
  7. InfluxDB: It’s About Time
  8. OpenTSDB: The Scalable Time Series Database

We first considered their suitability for our technical ecosystem and requirements. After careful evaluation, we eliminated InfluxDB(OpenSource) and OpenTSDB for their ability to scale. Thanos was ruled out due to concerns about read latency, which we will discuss in more detail later in this post. Warp10 and TimescaleDB didn’t align so well with our existing tech stack. We also experimented with M3DB, but it was before 1.0 so it was found relatively immature at that time.

Consequently, we decided to focus our benchmarking efforts on Cortex, which emerged as the top contender, and VictoriaMetrics, which was “the dark horse” in our selection process.

Performance Benchmark

The performance of timeseries databases can vary significantly depending on factors such as dataset characteristics (cardinality, number of labels, etc.). To evaluate the candidate solutions, we conducted extensive performance benchmarking using a substantial amount of production metrics. To ensure accuracy and consistency, we used Thanos as a reference, to systematically cross-compare the results from the tested solutions with the original data in Prometheus.

As Thanos retrieves and aggregates data from Prometheus instances across the globe, the intercontinental round-trip time (RTT) imposed a lower limit of approximately 300ms on its response time. While this latency fell short of our desired performance for user-facing queries, we were pleased to find that the solution performed admirably overall.

To conduct our benchmarking, we developed a specialized tool and executed a diverse range of query patterns. These included instant queries versus range queries, various time ranges (e.g., last 5 minutes, last 1 hour, last 6 hours, last 1 day, etc.), random offsets in the past, different levels of metric cardinality, aggregated metrics (such as sum or top-k), and different levels of parallelism. These tests ran daily for several weeks, generating comprehensive data for analysis.

The benchmark was conducted around fall 2020 and the results consistently demonstrated a clear pattern: VictoriaMetrics exhibited faster response times and maintained a shorter and more stable tail latency. Executing the same query twice also always yielded similar completion times with VictoriaMetrics where Cortex response times displayed more entropy. Moreover, we achieved these favorable outcomes without needing to fine-tune any configuration parameters for VictoriaMetrics, whereas Cortex demanded significant efforts to optimize performance.

Impact of cardinality over response time
Impact of series cardinality on response time

It is important to note that the benchmarks were conducted quite some time ago, and the Observability landscape has undergone rapid evolution since then. Both the Cortex/Mimir and VictoriaMetrics teams have released numerous exciting improvements to their projects. Cortex transitioned from a Cassandra backend to S3, the Cortex and Thanos codebases were partially merged, and GrafanaLabs recently introduced Mimir about which we heard several success stories since then. Therefore, if you are seeking a timeseries database solution, we highly recommend performing your own due diligence and benchmarking the available options using a representative dataset.

Other considerations

VictoriaMetrics architecture is simpler compared to Cortex/Mimir. As we were already fatigued by the complexity of BigGraphite, its streamlined design was a significant advantage for us. Its simplicity translated to easy setup, maintenance, and troubleshooting.

VictoriaMetrics architecture
Cortex architecture

Our deployment

When building a scalable timeseries solution, one crucial consideration is the cost involved. VictoriaMetrics’ impressive compression performance enables us to pack a significant number of datapoints in minimal disk space, averaging around 1 byte per datapoint.

A common approach to reducing the cost of timeseries storage is downsampling, which involves aggregating larger intervals of data (such as an hour or a day) into a single datapoint, typically computed as an average. While downsampling is a highly effective technique utilized by Graphite, it does result in smoothed-out graphs due to the averaging process and graphs can look different depending on the time window you are looking at. Maintaining a distribution for each interval would mitigate these issues, but it adds complexity and significantly increases the number of timeseries to index so we decided not to go in that direction.

To be able to display graphs without having to query tens or hundreds of thousands of timeseries each time our Prometheus instances already perform aggregations per datacenter, application and pool (prod, canary, dev,…). That has the benefit of reducing cardinality by the number of instances. Instance metrics are really useful only as long as the instance lives and a bit after for troubleshooting, that’s why we decided to keep those metrics only for 7 days. Aggregated metrics on the other hand are relevant at the service level and keeping them for 90 days was deemed enough for almost all use cases of live monitoring. Lastly, for some use cases like SLA/SLO/SLI and long term trends it could however be useful to have a longer retention so we setup an extra cluster with a retention of 1+ year. For all other use cases, especially Business Intelligence, Criteo possesses a best-in-class data analytics infrastructure that will be happy to ingest and process any volume of data with no cardinality limits.

By keeping low value, high cardinality and high churn metrics only for a shorter period we were able to cut the costs of the infrastructure by more than an order of magnitude.

Regarding high availability (HA), our VictoriaMetrics cluster is resilient to a single node failure due to its replication factor of 2. This means that the ingested data is always written to two vmstorage nodes, in the event of a node failure, the data is automatically rerouted to another node, ensuring data integrity and continuity. To safeguard against a datacenter failure or isolation and to be able to drain a cluster for maintenance purposes we have simply implemented two identical infrastructures in two different regions.

Our multi retention cluster architecture

Our current scale

Write path

We deployed a few vmagent nodes per datacenter, those are responsible for collecting metrics from that datacenter and forwarding them to the VictoriaMetrics clusters abroad. This approach offers several advantages:

  • Having only one remote write per Prometheus instance simplifies the remote write queue tuning process and enhances resilience, as any failures related to the remote write storage have a limited impact on Prometheus.
  • On-disk spooling. In scenarios where the connection to the remote datacenter becomes temporarily unavailable, the vmagent nodes are able to buffer data on disk until connectivity is restored, thus preventing data loss.
  • Save network costs with VictoriaMetrics remote write protocol

The good thing about having one cluster per retention window is that it is easy to scale and tune each according to their ingestion rate. It is for example possible to rely on different node flavors to optimize costs/performance ratios like in a hot-warm-cold Elasticsearch architecture.

Then it’s just one pair of vminsert per cluster with the default configuration. If a vmstorage is down the vminsert simply redirect the write to another node

Read path

As much as we like some features of MetricsQL (like the better handling of rate extrapolation, or WITH templates) we wanted to avoid vendor locking. Our user base is wide, basically everybody in the Criteo R&D has to deal with metrics at some point, they already have to know Graphite and PromQL so we wanted to avoid them having to worry about subtilities like when you can or cannot use MetricsQL. To do so we forked vmauth and added a PromQL check to ensure only PromQL queries are being evaluated. Additionally, we implemented a replay mechanism that duplicates queries to our backup cluster, allowing us to perform proper load testing after maintenance operations, such as version upgrades.

vmselect nodes are able to target vmstorage nodes from all clusters at once and that’s basically all we needed to unify the three layers of retention.

We kept all configuration parameters to their default values for vmselect nodes, they are able to run queries over a maximum of 300k timeseries whereas BigGraphite has a limit at 7k which is a pretty big quality of life improvement for troubleshooting big applications.

Are we satisfied?

Absolutely! In terms of stability and reliability, SLAs speak for themselves, we proudly display >0.998% availability over a year so we do sleep really well

In terms of performance, VictoriaMetrics has exceeded our expectations. It effortlessly handles the ingestion of 2 million datapoints per second per host without flinching. Even when fetching 300,000 timeseries for troubleshooting queries, the response remains impressively fast. The Grafana Metrics browser is highly responsive, enabling smooth exploration of available metrics. Furthermore, the VictoriaMetrics team has always been highly responsive and frequently releases updates, continuously pushing the boundaries of what the system can achieve.

Did we suffer from the lesser compatibility with PromQL ? No, to be honest, if you only run PromQL on it it makes little to no difference and when it did we found that it was for the better most often than not.

Last but not least, in terms of costs we are able to store about 15 times more datapoints for about four times less infra costs with better SLAs than with BigGraphite.

VictoriaMetrics : 15 computes + 46 storage for 20T datapoints
BigGraphite : 226 Computes + 156 storage for 1.3T datapoints

What could be improved?

No solution is perfect so here what’s we think is perfectible:

  1. Currently, replacing a vmstorage node is a manual process that involves data copying as VictoriaMetrics does not support cluster rebalancing. However, our storage servers have a very low failure rate so for now we had to perform this action only once.
  2. VictoriaMetrics is able to ingest data in the past but it requires to clear the vmselect cache to be sure that the newly ingested data is available
  3. Our development teams, responsible for libraries, SDKs, and frameworks, would greatly appreciate the ability to query data from all tenants without having to federate and duplicate metrics in their respective tenants. This functionality could potentially be achieved through the use of tools like GitHub — jacksontj/promxy: An aggregating proxy to enable HA prometheus, which has recently shown signs of renewed activity.

--

--