How to Serve 200K Samples per Second with Single Prometheus

Jor Khachatryan
Picsart Engineering
9 min readMay 28, 2021

What is IT monitoring and why is it essential?

IT monitoring is the process of gathering metrics about the operations of an IT environment’s hardware and software, to ensure that everything is functioning as expected, to support applications and services, and to optimize the infrastructure. Basic monitoring is performed through hardware, software, and network operation checks, while more advanced monitoring gives detailed views on operational statuses, including average response times, number of application instances, error rates, cross-application traces, application availability, latency, etc.
Monitoring is an essential tool in the life of a DevOps Engineer. It puts them on the frontline of the IT world, playing a vital role.
Not having a monitoring system is like trying to find something in a dark room with your eyes closed. Without monitoring, it would be difficult if not impossible to detect anomalies and issues which need to be resolved swiftly.
I would also like to mention that other tools also exist, which can help with anomaly detection and problem-solving.

What tools do we use for monitoring?

There are many monitoring tools, some of which are open-source. Each one of the tools mentioned has its advantages and disadvantages.
Furthermore, today’s topic suggests solutions through open-source tools, especially Prometheus and Thanos. I will explain how to build a monitoring system that can retain data for long periods, which can handle up to 200K samples per second. The important point is that all of these processes are realized on one centralized Prometheus and Thanos server.
To make the system easily manageable you can use config management tools such as Puppet and Ansible to deploy and oversee alerting rules, and to create and save backups of them.

What is Prometheus?

Prometheus is open-source software, which serves as a system monitoring and alerting toolkit. It records metrics in its time-series database (allowing for high dimensionality), built using an HTTP pull model with flexible queries and real-time alerting.
In PicsArt’s infrastructure, we need scalable architecture to handle high loads. To solve the scalability problem we use Kubernetes. But to keep things simple we use a centralized Prometheus server, which we call Main Prometheus. This way, we can centralize the host target’s metrics (bare-metal hosts, standalone containers, pods, etc), which we monitor. So we are managing Prometheus servers in different environments both Kubernetes and non-Kubernetes. In some clusters, there is more than one replica.
Data centralization is realized with Prometheus’ federation feature, which reads other Prometheus servers’ time series and saves it internally. As mentioned in this discussion thread of the Prometheus community, the federation feature does not guarantee unlimited data ingestion, which we have also tested and verified ourselves. This is a problem for us, as we need to retain masses of data for longer terms.
Nevertheless, we have found a way to surpass this obstacle, which is to use the federation feature in an optimal way. I will explain this topic in the config section.

Config
Let’s talk about configurations.
Below are the Federation configs:

Img.1: Federation config for AWS Kubernetes cluster

I would like to discuss one important topic.
During the federation, do not collect all metrics and then drop:

match[]:
— ‘{__name__=~”.+”}’

We need to use __name__=~”.+” carefully in the match section. If we have a large quantity of data, we will have noticeable issues, because under the metrics_relable_configs section, the action: drop section works only with the data that is federated. This means that it doesn’t work until the data is brought to Prometheus. While bringing the data, it starts to get devised and corrupted, which causes the federation job to fail.
To evade these issues, we need to clearly define the metrics under the Match: section.
By managing clusters this way, we avoid running into any issues.
Firstly, we don’t want to have Alertmanager installed on every Prometheus server. We want the rules to be located in one place to easily manage and control them.
All of the aforementioned details can be useful, but it is key to consider that the data used is on a large scale and Prometheus is not compatible with long-term data. To overcome this issue/case, we have integrated Thanos.

Img. 2: The simple architecture of Thanos.

However, there are other useful tools that can also be used as a solution, like Cortex, which helps write the data in Object Storages. But I need to mention once more that this differs in every case. Each case has a different solution. Now that we have a large quantity of data, we can read the data with Prometheus from anywhere, including Prometheus servers set up in Kubernetes Clusters, or from the exporters located in the infrastructure, or with the help of Netdata.
Thanos then integrates with existing Prometheus servers through a Sidecar process, which runs on the same machine or in the same pod as the Prometheus server. The purpose of the Sidecar is to upload Prometheus data into object storage and give other Thanos components access to the Prometheus instance that the Sidecar is attached to (via a gRPC API).

Img. 3: Thanos Sidecar

Store
As the sidecar backs up data into the S3 object storage, we are decreasing Prometheus retention and storing less locally (seven days). However, we need a way to query all that historical data again. The store gateway does just that by implementing the same gRPC as the sidecars, but backing it with data it finds in the S3 bucket, downloading it, and storing it (cached) locally on the main Prometheus server. Just like sidecars and query nodes, the StoreAPI needs to be discovered by Thanos Query.

Img. 4: Thanos Store

I must mention that one of the crucial components is the Thanos Compactor.
The compactor component simply scans the Amazon S3 bucket and processes compaction based on its configuration. The compactor is also responsible for downsampling of data:

  • Creating 5m downsampling for blocks larger than 40 hours (2d, 2w)
  • Creating 1h downsampling for blocks larger than 10 days (2w)

Img. 5: Thanos Compact

After these steps, we use the Thanos Query component.
The Query component is stateless and horizontally scalable. It can be deployed with any number of replicas. Once connected to the Sidecars, depending on the query, it detects which servers need to be contacted for a given PromQL query. If the query is done on less than the past seven days of data, the data is taken from the sidecar (Prometheus TSDB), otherwise, it is taken from the store, and the store connects to S3 to cache data on the disk of the main Prometheus server. The implementation of the query uses Prometheus’ official HTTP API.

We have mentioned that there is a large volume of data and everything is stored in Object Storage. Meanwhile, we have to be able to query everything from Grafana, without any latency.
We have already advanced with large-scale loads. We have decided to approach this issue with the help of caching. There are a few tools that can help us with caching, which are Thanos query frontend, Cortex-prometheus frontend and Trickster.
From this list, we use Cortex query frontend. Some of its capabilities are:

  • Redis as a cache datastore
  • Query splitting
  • Parallel querying

The Prometheus-frontend service uses a single binary of Cortex. It exposes HTTP port 9092. It’s a proxy between Grafana and Thanos-Query. The responsibility of the service is to cache requested range_query data somewhere. We are caching data into Redis (db0) database, which is also deployed next to Prometheus-frontend service as a Docker container with a 30 Gb MaxMemory limit.

Img. 6: The final architecture of the monitoring infrastructure

As shown in image 6 we have a distributed monitoring infrastructure, and the parameters of the servers are the same. I would provide the parameters of the monitoring-backend-server1 on which the Main Prometheus server is running.
From our experience, we can suggest a server with the following parameters:

  • CPU 16 cores
  • RAM 48 GB
  • SSD 1TB

Finally, in the image below, we can see the actual usage of resources by Prometheus:

Img.7: CPU usage of the Main Prometheus server

Img.8: Memory usage of the Main Prometheus server

Img.9: TSDB (disk) usage with 7 days retention

If Redis goes down nothing will happen. We must only start it.
We have a second Prometheus server, which is configured as Main Prometheus. The only difference is that it is passive and it serves as a reserve server. It means that if any incidents occur with the Main Prometheus e.g. the server goes down, the reserved server will automatically replace it by changing its state to active.
The next two images (Img. 10, Img. 11) show the performance of the system when querying with a caching component and without it. As we see in Img. 10 when we query data for the last 30 days it is executed in 4.4 seconds with the caching component. In Img. 11, we can see the result of the same query when we use caching of Prometheus-frontend. The query was executed in 0.3 seconds, so we have up to 14x time faster performance with caching.

Img.10: First time querying, with Thanos Query datasource (no cache)

Img. 11 Second time querying, with Prometheus datasource (with Cortex cache)

Now we are able to query Prometheus-frontend, so that every query is cached in RedisDB. This way, Grafana goes through Prometheus-frontend, which then connects to Thanos Query. Simultaneously, Thanos Query connects the Prometheus Sidecar and Store to collect all of the data (Data collected from Object Storage and Prometheus.) This way we just need to write the Prometheus-frontend URL:Port in our Grafana datasource, instead of Prometheus URL:Port.
This made it possible for us to have big data while it can be queried easily and swiftly. It makes no difference for us how big the data or the infrastructure can get in size. This means we can get this entire data through a single Prometheus server, which handles approximately 200,000 samples per second, with no downtime.

Img. 12 Samples per seconds appended in TSDB

To sum up, by having the required size, volume of metrics, quantity of servers and one main Prometheus Server, we can handle large quantities of data.
We also have other solutions related to Loki, ELK stack, Jaeger and other tools. It is worth mentioning that in large-scale companies like PicsArt, monitoring plays a major role and it is extremely important, not only as an alert for DevOps engineers, but also because such large volumes of data can be used in the analytics servers and services.

Mention about gains

After the full deployment, we have noticed some “invisible” errors, and with the good dashboard, it was possible to decrease the error rate in the PicsArt environment 3x. Which definitely affects app quality and user experience. The troubleshooting time decreased from 15 min to 2–3 on infrastructure-related cases. With proactive monitoring, we have predicted and fixed more than 200 cases during the last year, which didn’t cause the system downtime
Lastly, note that any large-scale data needs to be smartly visualized.

--

--