Real-time monitoring using optimized Prometheus and Grafana

Abhisar Bharti

Published in

Machine Learning Reply DACH

9 min readOct 25, 2022

A one stop monitoring solution

Buckle up for a long but very informative read!!!

In a large data-driven organization, multiple subsystems work together to generate value from the raw data. Whether it is a complex data processing pipeline or a full-fledged AI model ecosystem, the monitoring of the health of individual components plays a major role in efficient day-to-day activities. The real-time monitoring solution not only helps in identifying a breakdown instantly, but some solutions go beyond and help us in reporting and simultaneously working on a fix. In this post, I will briefly be introducing a combination of two tools that I have used in one of my projects to achieve near real-time monitoring for our data processing architecture. We used Prometheus.io to grab information related to our systems in the form of metrics and then we used Grafana to visualize the information collected by using Prometheus as a data source.

Since there are plenty of articles already floating on the internet about Prometheus and Grafana, I would just briefly like to introduce them here and then explain some of the challenges that we faced during our work, and how we went on to solve them. So, sit back, relax and let us go sailing the rough sea🙂.

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud [Overview | Prometheus]. It supports time series data which is identified as metrics. Metrics also have a key/value pair. We investigate the metrics in more details in the later part of the article. Prometheus uses a pull-based approach to collect metrics data by sending HTTP requests on metrics endpoints and this information can be queried using a flexible and easy to interpret query language called PromQL. For basic visualization of data, Prometheus also has a UI. However, the real potential of collected information can be leveraged using visualization tools like Grafana. A well-tuned Prometheus along with Grafana can collect millions of metrics every second and visualize them in the form of Dashboards. However, as I promised, I will come back to the tuning part since sometimes, if not controlled properly, Prometheus can become very messy. Below is an overview of the Prometheus architecture.

Prometheus works standalone, thus it ensures that even in the case of a component failure in a complex architecture setup in a production environment, it will still be able to report and collect useful information. It is highly reliable and does not require expensive infrastructure for efficient operation. Prometheus follows a pull-based approach for metrics collection but in some cases, when that is not possible, it also supports a push-based approach using pushgateway to record metric data. Prometheus can be run in a containerized environment, and it is also working well with Kubernetes. It is very important to understand that Prometheus only works with numerical metrics and time series data and supports multi-dimensional data. One of the strengths of Prometheus lies in the fact that it supports a huge amount of data exporters. And data collection along with querying is highly robust. However, one should also understand that in some specific cases it might not be a good idea to use Prometheus. When it comes to the use case that requires 100 % accuracy, for example billing, Prometheus is not a good option since the collected data might not be complete and sufficient to fulfill the requirement.

Prometheus Metrics

Prometheus metrics are basically time series numerical values. These metrics are a measurement of a certain behavior of an application over time. These metrics help the user to understand why the application behaved in a certain way. Metrics that records health help us to know if a certain application is running or not, metrics that record web traffic help us to know how much the current load on a specific web server is. Similarly, some of the core parameters like CPU, memory and disc usage of a server running certain application can be monitored using Prometheus metrics for efficient day to day activities.

Now, let us have a brief overview of four types of metrics supported by Prometheus:

Counter: The counter metric type is used for anything that increases i.e. the values do not go back to their previous value. This metric can be used to record measurements like the number of errors, requests, completed tasks, etc.
Example: http_requests_total
Gauge: The gauge metric type can be used for anything whose value can go up or down. This metric type can be used to record current CPU usage, memory usage, etc.
Example: up{instance=”kafka” job=”apache”}
Histogram: The histogram metric type measures the frequency of observations that fall within a predefined range. For example, we could measure the latency of a http request using a histogram metric.
Example: prometheus_http_request_duration_seconds_bucket{handler=”/graph”}
Summary: The official Prometheus documentation tells us to avoid using summary metrics whenever possible and use Histograms instead. A summary is used when the bucket size of a metric is not known.

Prometheus over other tools

There are plenty of tools that can act as an alternate to Prometheus. However, I would only pick some of the well-known ones and will also try to highlight the advantage of using Prometheus over them.

Graphite is one of the popular alternatives to Prometheus. However, Prometheus has a richer data model and query language compared to Graphite. Graphite acts as a passive time series database. Moreover, every sample is stored in a separate file and the new sample overwrites the old one after a certain time interval. In comparison, Prometheus also creates one local file per time series, but new samples are simply appended. Graphite can be a good option if the need is to store long-term data in a clustered solution.

InfluxDB and Prometheus have a lot of similarities. InfluxDB also uses labels like Prometheus, and they are called “Tags”. InfluxDB supports float64, int64, bool, and string data types however Prometheus supports float64 data types and limited support for strings data types. InfluxDb is also more suitable for event logging. Commercial InfluxBD is easier to scale horizontally but at the same time, they require the management of distributed storage system. In summary, both options are well based on the use case. However, when the requirement is event logging, clustered environment and keeping data replicas, then InfluxDB should be preferred. If the goal is to have metrics, high availability, and robust query language with alerting and notification support, then Prometheus should be preferred.

OpenTSBD is also like Prometheus when it comes to the data model. OpenTSBD does not have its own query language and allows simple aggregation and math. The storage is built on Hadoop and HBase which allows for horizontal scaling. Once again, Prometheus offers a much better query language, handles the high number of metrics, and provides a full-fledged monitoring solution. But if we are running Hadoop then OpenTSBD could be a good choice.

Optimization of Prometheus

In our working with Prometheus, we realized that as our usage was growing and lots of systems were being monitored through Prometheus, we had a huge influx of metrics in our Prometheus environment. We reached a point, where the performance of Prometheus and our alerting ecosystem was degrading. This is a very common problem with a production level Prometheus, when it is not optimized over time.

There are plenty of ways to handle this problem. I am going to explain how we optimized our setup. The first thing to do is to understand the top culprits who are creating this problem. In other words, the goal is to identify the top bandwidth-heavy metrics in Prometheus. We can do this by running the “topk” command. Below is an example to get the top 50 bandwidth-heavy metrics in our Prometheus server:

topk(50, count by (__name__)({__name__=~”.+”}))

In our use case, multiple teams use Prometheus for monitoring. So, it was important to arrange a short workshop with all of them and go through the generated list of top 50 metrics to identify which of them is one that they never used, do not use anymore, or do not plan to use in the future. We can simply stop collecting the ones discussed before. This will provide Prometheus with a breathing space that helps in efficient operation. Now, let us have a look at how to do it.

To drop metrics that are not being used, we need to make changes to the configuration file by using metrics relabeling. Below is an example of that:

- job_name: Kubernetes-dev 
… 
metric_relabel_configs: 
- source_labels: [__name__] 
regex: ‘(kube_pod_state|kube_pod_failures_reason|kube_pod_waiting_reason)’ 
action: drop

Performing the above-mentioned action will ensure that all three metrics will not be stored in the database, and this will help in reducing disk usage.

One more thing to consider is the fact that Prometheus is mostly used for real-time monitoring and not for long-term storage. So, if the organization does not enforce you to keep data for 15 days, then one parameter to change would be the below one according to your need:

— storage.tsdb.retention.time: (7d/3d), whichever suits the best 😊

And finally, a good idea would be to play with the scrape interval and monitor disk usage and other parameters for some time.

scrape_interval: 5m

In an ideal situation, this will bring your Prometheus to good health, as was the case with us. Else, you can also move to the “Cloud” way, which I will briefly explain next.

AWS Managed Prometheus

During our work, we also had the opportunity to explore the managed Prometheus with AWS. AWS provides us with a workspace that is scalable and elastic for storing all our metrics data and thus solves the problem with storage and load. It is fast, removes the burden associated with the operation, and provides secure data access. The AWS managed Prometheus is already integrated with AWS ECS, AWS EKS, and AWS Distro.

Grafana

Finally, the last part of the puzzle. I think Prometheus and Grafana are a match made in heaven. Grafana integrates well with Prometheus for all the visualization and alerting needs one can have. By default, Grafana listens to port 3000. Grafana supports a wide variety of data sources too. Unlike some other visualization tools, I like that you do not have to ingest data to a backend store to use it. To me, it is very efficient not having to do all that hassle and simply connect my data source wherever its lying and quickly do a quick exploration or creating a detailed dashboard to view all important metrics. Grafana also allows to share the dynamic dashboard with other users and the customization options are really detailed. Users can start by copying an already created panel and do customizations on top of it, in the way the requirements demand it. I think this brings lots of transparency and generates a strong sense of collaboration. In my experience, the level of customization that Grafana brings is awesome and every new versions add some important updates. Not to mention, the support for external plugins in Grafana is something that I find really useful.

Finally, who does not like pre-built stuff? 😊 Thankfully, Grafana has a plethora of cool dashboards built and supported by its strong community. There is a huge chance, that you might find something helpful here that suits your requirement. The counts of these dashboards are increasing every day. This brings to my last point: The file type for Grafana dashboard is Json and for importing the pre-built dashboard all you need is a dashboard ID. Below, is an example for the node exporter dashboard from the community with ID: 10242. Happy exploring!

Node Exporter Full with Node Name | Grafana Labs

Conclusion

Prometheus and Grafana together are a powerful set of tools for real-time monitoring of the infrastructure. Depending on the use case and the requirements, these two tools can be extremely useful in catering to the needs of the operation team in an organization. In this article, we looked at what Prometheus and Grafana are and how should we decide if Prometheus is the tool that we should opt for. And if we decide to go on with Prometheus, which problems we might face in the later stage of operations. We also looked at how we can solve these problems and we finally had a brief look at Grafana for visualization. We at Machine Learning Reply also provide consultation and support related to Prometheus and Grafana whether you are planning to start your monitoring journey or have already started but need helping hand. Happy observing!