Sismology:
Iguana Solutions’ Monitoring System

Published in

Iguane Solutions

8 min readMay 28, 2020

Timeseries, Long Term Storage, Multi tenancy & High Availability

This article is a retrospective of several months of continuous improvement since the creation of our current monitoring system: what challenges did we face, how we overcame them and how we finally switched to Victoria Metrics.

How it started

At Iguane Solutions we have created a multi-tenant system based on Prometheus for our alerting and metrology needs: Sismology. It began as a project to replace our monolithic Naemon and Graphite (with collectd) by a unique system merging metrology and alerting based on the current standard: Prometheus.

While Prometheus gave us a nice metrology and alerting core, we faced 3 challenges:

Multi-tenancy: as we were planning on letting our customers access their own data, the single tenancy of prometheus would have to be overcome.

Long term storage: long as several years, indeed it is not uncommon for our customers (or ourselves) to compare a specific time of the year against year N-1 or N-2.

High availability: 0 downtime target while still having the possibility to put some nodes offline for maintenance purpose.

Because prometheus is single tenant, the answer for multi tenancy was obvious: each independent infrastructure would have its dedicated Prometheus appliance, each node representing a single tenant: it also had the advantage of spreading the compute needs for scrapping/alerting across multiples nodes and reducing the failure domain for the alerting. In addition we also wanted HA so each infrastructure would have two appliances with identical configuration and a virtual IP floating between the two. We started with 2 vCPU / 2GiB RAM for each appliance.

However, it did not completely solve the HA need; even if the virtual IP is changed to secondary appliance when the primary is down giving access to the metrology of the secondary node while still having alerting functional, there was no backfilling mechanism. This means that once back online, the primary node would have a gap within its own data. It was acceptable for our prototype and we considered promxy if we found out the issue to be problematic for us later on.

From the starting point, we felt that the long term storage challenge would be the most problematic one, especially coupled with multi tenancy. We added a local influxdb on each appliance connected to the local Prometheus in order to extend prometheus storage. To do so, we scaled our appliances up: each of them was now 4 vCPU / 8GiB RAM.

Unfortunately we faced two main issues:

Rapid growth of the data stored on disk
Explosion of RAM usage due to high cardinality

Fine tuning and custom development

Disk usage & remote read proxy

To tackle the disk space issue we started to downsample our data into several different retention policies using continuous queries (15s for 2 days, 30s for 7 days, 1m for 1 month and so on…). Reaching these different retention policies automatically from a single point of view (the prometheus remote read client) created a new problem: prometheus would only target one of these retention policies.

To bypass it, we developed a OSI layer 7 proxy inspecting the prometheus remote read query range to automatically redirect the query to the retention policy with the best precision covering the requested time range: rrinterceptor.

The retention policy selection issue solved, we discovered the junction of Prometheus’ own data and InfluxDB downsampled data could be problematic for some function as rate/irate (the counter reset detection system in prometheus could go postal in some cases).

RAM usage, cardinality and the birth of our own agent

The RAM usage issue had no simple solution: cardinality needed to be lowered by any means necessary. We developed our own agent, an all-in-one plugin based agent inspired from InfluxData agent telegraf: sismograf.

We created it in order to have strict control over the cardinality: each metric and each tag for each plugin were carefully crafted in order to only have what we needed and not a single metric or tag more. This helped our cardinality problem A LOT, so much that we thought our RAM usage issue was resolved.

To be fair, it was indeed resolved for several months. We started to slowly expand the small percentage of infrastructures on our new system until we started to get alerts on RAM usage from the oldest appliances: InfluxDB was becoming greedy. Some OOM kills later, we upgraded to 16G of RAM theses appliances and stopped the addition of new infrastructures to the new Sismology system: if after a few months only, each appliance needs 16G of RAM what about 1 year ? 2 years ?

At that time the number of infrastructures on the new system was 18: this means 36 appliances with at least 16G of RAM. The system will soon face a new issue, an economical one. We stopped adding new infrastructures to the system while thinking of a way out of this dead end.

Victoria Metrics

How Victoria Metrics helped us improve Sismology

It’s over 9000

While trying entirely different systems (Thanos, Cortex, M3DB, etc…) we crossed paths with a newcomer: Victoria Metrics (even though at that time only the single node version was open source and we were still looking for a solution with downsampling).

We were truly impressed by it: a lot of issues we had with Prometheus were taken care of (extrapolation, automatic rate intervals, etc…) but also its performance compared to InfluxDB was incredible (since then, an even more impressive benchmark has been released).

Back then, the lack of downsampling was considered problematic. Nevertheless, after sharing pros and cons with Victoria Metrics team about downsampling compared to keeping the original data and compressing it, we decided to give it a try, especifically to evaluate the higher compression ratio Victoria Metrics was supposed to have. Fortunately, a few days before we decided to start the test, the cluster version of Victoria Metrics (with… multi tenancy!) was open sourced by Victoria Metrics.

We spawned a small cluster and configured a second remote write for one of each pair of our prometheus appliances (remember: we have 2 of them for each infrastructure). We were expecting to have to scale the cluster up while gradually adding the different remote writes but… in the end, we did not have to. Victoria Metrics was so efficient that even after adding all the infrastructures the load average was so low that we could have been using only 1 of each roles (but we did not want to for HA purpose):

2 haproxy (each 2vCPU/2G RAM)
2 vminsert (each 2vCPU/2G RAM)
2 vmstorage (each 2vCPU/2G RAM)
2 vmselect (each 2vCPU/2G RAM)

Total: 16vCPU / 16GiB RAM

Victoria Metrics seemed to be the answer we sought for a long time but a last thing was bothering us:

We did not want to keep a 15s precision for long term storage: one of the two appliances was set to 1m scrape interval (and exporting its data on Victoria Metrics) while the other stayed at 15s scrape interval for faster alerting but also as a direct access to high freq values
The “backfilling” issue was still present: putting the second appliance down would still results in a gap on our metrology data on Victoria Metrics
Setting both of the appliance to 1m with remote writing to Victoria Metrics would also be problematic:

-Victoria Metrics would have at best a ~30s precision or at worse a 1m precision with 2 points really close to each other, not ideal for Victoria Metrics to automatically determine the timeseries interval

-The alerting, still performed on both Prometheus appliances, would be slower

The final touch: deduplication

Victoria Metrics was a good lead but still not ready for our use case; we kept it active to see how it would perform over time. And a few weeks later, Victoria Metrics team added a feature which changed everything: deduplication.

By setting a 60s deduplication interval on the cluster, Victoria Metrics would automatically de-duplicates points too close to each other. This means:

We control the interval for the long term storage (60s)…
… no matter the number of redundant points sent to the cluster and so, the scrape interval configured on the prometheus appliances

It allowed us to do the following:

Our pair of appliance for each different infrastructure:

> could again have an identical configuration:

-15s scrape interval (which we are more comfortable with for the alerting part)

-both remote writing to the Victoria Metrics cluster

> go back from 4 vCPU / 16GiB RAM to 2 vcpu / 2GiB of RAM for each appliance (yes this means we recovered 72 vCPU and 504 GiB of RAM from our 36 appliances while still having an underused cluster with a total of 16 vCPU / 16GiB RAM!)

The Victoria Metrics cluster deduplicates data points to only keep a point every ~60s: this means we can send data points from 15s scrape intervals from both prometheus appliances resolving the HA and backfilling issue in one shot.
We keep these 60s points for a duration of 3 years

Conclusion

After a lot of trials & errors and custom development, we finally managed to reach our initial goal:

Multi-tenancy: Victoria Metrics cluster allows multi-tenancy natively and each pair of appliances upload their data to a different tenant of the Victoria Metrics cluster. We also offer to our customer a dedicated tenant within a highly available Grafana with automatic datasource & dashboards deployment and updates.
Long term storage: Given our current ingestion rate (~12.1k datapoint/s), the average size of one data point (0.59 byte!), the average rate of data points being deduplicated (~8.6k datapoint/s) and our 36 months of retention policy; the total projected space needed is only 191 GiB… (we created a Victoria Metrics cluster dashboard to ease the day to day cluster operations and capacity planning, feel free to try it!)
High availability: the cluster in our current deployment form allows to have 1 node of each type offline (for vmstorage this still means some data unavailability but not dataloss*) and we have for each infrastructure identical prometheus sending their metrology data to the Victoria Metrics cluster and their alerting data to an alert managers cluster. This alert managers cluster is composed of 3 nodes, each on different datacenters. Finally, the Grafana we are using for visualization is entirely serverless with auto recovering and auto scaling policies.

*actually this is not the case anymore thanks to the replication feature added to the brand new v1.36.0 !

Thank you and keep up the good work Victoria Metrics team!

Written by Edouard Hur — VP Engineering @ Iguana Solutions

Sismology: Iguana Solutions’ Monitoring System