Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics

Published in

Miro Engineering

10 min readSep 14, 2020

“Why” of this article?

Prometheus is a great tool for monitoring small, medium, and big infrastructures.

Prometheus anyway, and the development team behind it, are focused on scraping metrics. It’s a particularly great solution for short term retention of the metrics. Long term retention is another story unless it’s used for collecting a small number of metrics. This is normal in some way, because most of the time, when investigating some problems using the metrics scraped by Prometheus, we use metrics not older than 10 days. But this is not always the case, especially when the statistics that we are searching for are a correlation between different periods, like different weeks per months, or different months, or we are interested in keeping historical synthesis.

Actually, Prometheus is perfectly able to collect metrics and to store them even for a long time, but storage will become extremely expensive since Prometheus needs to use fast storage, and Prometheus is not known to be a solution which permits to reach HA and FT in a sophisticated way (as we are going to explain there is a way, not so sophisticated, but it’s there). We will explain in the present article how to achieve HA and FT for Prometheus and also why we can achieve long term storage for metrics, in a better way using another tool.

That said, during the past years many tools started to compete and many are still competing for solving those problems and not only.

The common components of a Prometheus installation are:

Prometheus
Blackbox
Exporters
AlertManager
PushGateway

HA and FT of Prometheus

Prometheus can use federation (Hierarchical and Cross-Service), which permits to configure a Prometheus instance to scrape selected metrics from other Prometheus instances (https://prometheus.io/docs/prometheus/latest/federation/). This kind of solution is pretty good when you want to expose only a subset of selected metrics to tools like Grafana, or when you want to aggregate cross-functional metrics (like business metrics from one Prometheus and a subset of services metrics from another one which is working in a federated way). This is perfectly fine, and it can work in many use cases, but it’s not compliant with the concept of High Availability, nor with the concept of Fault Tolerance: we are still talking about a subset of metrics, and if one of the Prometheus instances goes down, those metrics will be not collected during the down. Making Prometheus HA and FT must be done differently: there is no native solution from the Prometheus project itself.

Prometheus can achieve HA and FT in a very easy way, without the need for complex clusters or consensus strategies.

What we have to do, is to duplicate the same configuration file, the prometheus.yml in two different instances configured in the same manner, that are going to scrape the same metrics from the same sources. The only difference is that instance A is also monitoring instance B and vice versa. The good and old concept of redundancy is easy to implement, it’s solid, and if we use IaC (Infrastructure as Code, like Terraform) and a CM (Configuration Manager, like Ansible) it will also be extremely easy to manage and maintain. You do not want to duplicate an extremely big and expensive instance with another one, it’s better to duplicate a small instance, and to keep only short term metrics on it. This also makes the instances quickly recreable.

What about the other mentioned services?

Well, the AlertManager has the concept of a cluster, and it’s capable of deduplicating data received from multiple Prometheus instances and interacting with other AlertManager to fire an alert only one time. So, let us install the alert manager in two different instances, maybe the two that are hosting the Prometheus A and its copy, the Prometheus B. Of course we also use our IaC and CM solution to keep the AlertManager configuration in code.

NodeExporters are installed directly on nodes which are the source of the metrics you are collecting, there is no need to duplicate anything there, the configuration of Prometheus is the same, so the only need is to permit Prometheus A and Prometheus B to connect to them.

PushGateway is a little bit different, simply duplicate it’s not enough, you have to create a single point of injection for metrics which are pushed to it (while Prometheus works pulling the metrics). The way to make it HA and FT is to duplicate it on the two instances, and put in front of them a DNS, configured as an active/passive failover, so there will always be a push gateway active and in case of failure, the second one will be promoted as the active one. In this way, you can provide a unique entry point to batch processes, lambdas, sporadic functions, etc. You can also use a balancer in front of them, personally I prefer an active/passive solution in this case, but it’s up to you.

BlackBox is another tool with no concept of HA and FT, but we can duplicate it also, in the same two instances, A and B, that we have already configured.

Now we have two small instances of Prometheus, with two AlertManager that are working together as a cluster, two PushGateways in active/passive configuration, and two BlackBoxes, so HA and FT are achieved.

There is no reason to use these instances for collecting all the metrics in your farm, which might be composed of different VPCs, that can reside in different regions, be part of different accounts, or even be hosted in different cloud providers, and if you are lucky, in your farm there is also something on-premises. There is no reason to do so because the small instances would become extremely big in this case; when something small fails it’s normally easier to fix. It’s common practice to have many Prometheus instances that are in HA and FT configuration (like we described previously) and that are responsible for specific parts of the infrastructure, the definition of part is really up to you, it depends on your needs, requirements, network and security configuration, trust between your teams, etc.

So, as a recap, we have small or relatively small instances of Prometheus, duplicated with all the services mentioned above, we have the code to recreate them quickly, and we can tolerate a complete failure of one instance per group of them. This is definitely an improvement in the right way if our HA and FT plan used to be called “hope”.

VictoriaMetrics

We have a Prometheus and its ecosystem configured for HA and FT, we have multiple groups of Prometheus instances that are focused on their part of the infrastructure and they are relatively small.

Cool, but we are keeping the data for only, let’s say, 10 days, that’s probably the most important period to query but of course it’s not enough, what about long time storage for metrics?

Here come solutions like Cortex, Thanos, M3DB, VictoriaMetrics, and more others. They can collect the metrics from different Prometheus instances, deduplicate the duplicated metrics (you’ll have a lot of them, remember, every Prometheus instance you have is duplicated, so you have double metrics), and they can provide a single point of storage for all the metrics you are collecting.

Even if Cortex, Thanos, and M3DB are great tools, definitely capable of achieving the goal of long term storage for metrics, and also to be themselves HA and FT, we chose the newborn VictoriaMetrics. This article will not focus on comparing all those tools, but I am going to describe why we have chosen VictoriaMetrics.

VictoriaMetrics is available in two different configurations, one is an all-in-one solution, easier to configure, and with all the components together (it’s a good and stable solution, also capable to scale, but only vertically, so it can be a choice for you depending on your needs) and the cluster solution, with separated components, so you can scale vertically and horizontally, for every single component.

We like complex things (that’s definitely not true) so we decided to use the cluster solution.

The cluster version of VictoriaMetrics is composed of three main components, the “vmstorage” (responsible for storing the data), the “vminsert” (responsible for writing the data into the storage), and the “vmselect” (which is responsible for querying the data from the storage). The tool is very flexible, and the vminsert and vmselect are sorts of proxy.

Vminsert, as said, is responsible for inserting the data into the vmstorage. There are many options that you can configure, but for the scope of this article, it’s important to know you can easily duplicate vminsert in an arbitrary number of instances, and put a Load Balancer in front of them as a single point of injection for incoming data. Vminsert is stateless, so it’s also easy to manage, duplicate, and it’s a good candidate for immutable infrastructure and autoscaling groups. The component accepts some options that you should provide, most important are the storage addresses (you have to provide the list of the storages), and the “-replicationFactor=N”, where N is the number of the storage where the data will be replicated. So, who will send the data to the balancer in front of the vminsert nodes? The answer is Prometheus, using the “remote_write” configuration (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write), with the Load Balancer of vminsert as a target.

Vmstorage is the core component and even the most critical one. Contrary to the vminsert and vmselect, the vmstorage is a stateful, and every instance of it doesn’t really know about the other instances in the pool. Every vmstorage is an isolated component from its perspective, it’s optimized to use High-Latency IO and low IOPS storages from cloud providers, which makes it definitely less expensive than the storage used by Prometheus. Crucial options are:

“-storageDataPath”: the path where the metrics will be saved into the disk,
“-retentionPeriod”: like in Prometheus the period of time the metrics is retained,
“-dedup.minScrapeInterval”: which in background deduplicate the received metrics.

Every vmstorage has its own data, but the “replicationFactor” option from the vminsert means that the data is sent and therefore replicated in N storages. The component can be scaled vertically if needed, bigger storage can be used, but because of the type of this storage (High-Latency IO and low IOPS), it will be not expensive even for long term retention.

Vmselect is responsible for querying the data from the storages, likewise the vminsert, it can be easily duplicated in an arbitrary number of instances and can be configured with a Load Balancer in front of them, creating a single entry point for querying metrics. You can scale it horizontally, and also use many options. The Load Balancer, as said, will be the single entry point for querying data, which now is collecting metrics from multiple Prometheus group of instances, and retention that can be arbitrarily long, depending on your needs. The main consumer of all this data will be probably Grafana. Similarly as the vminsert, the vmselect can be configured in an Autoscaling Group.

What About Grafana?

Grafana is a great tool to interact and to query metrics from Prometheus, it can do the same with VictoriaMetrics via the Load Balancer in front of the vmselect instances. This is possible because VictoriaMetrics is compatible with PromQL (the query language of Prometheus) even if VictoriaMetrics also has its own query language (called MetricsQL). Now we have all of our components in HA and FT, so let’s also make Grafana an HA and FT solution capable.

In many installations, Grafana uses SQLite as a default solution for keeping the state. The problem is that SQLite is a great database for developing purposes, mobile applications, and many other scopes, but not really for achieving HA and FT. For this scope it’s better to use a standard database, as an example we can use an RDS Postgresql, with Multi-AZ capabilities (that will be responsible for the state of the application), and this solves our main problem.

For the Grafana application itself and in order to provide users with a single entry point to interact with it, we can create an arbitrary number of equal instances of Grafana, configured to connect to the same RDS Postgresql. How many Grafana instances to create is up to your needs, you can scale them horizontally, and also vertically. Postgresql can also be installed on instances, but I’m lazy and I like to use services from cloud providers when they are able to do a great job and are not vendor locking. This is a perfect example that can make our lives easier.

Now we need a Load Balancer which will be responsible for balancing the traffic between the N instances of Grafana and our users. We can also resolve our unfriendly Load Balancer address with a friendly DNS name.

Grafana can be connected to VictoriaMetrics vmselect Load Balancer using the datasource type Prometheus, and this closes our infrastructure for observability. Our infrastructure is now HA and FT in all of the components, configured to be resilient, scope focused, long term storage capable, and cost optimized. We can also add an automated process to create scheduled snapshots of the vmstorages and send them to an S3 bucket compatible, to make the retention period even longer.

Well, this was the metrics part, we are still missing the logging part, but this is another story :)