Overcoming scalability issues in your Prometheus ecosystem

Published in

keptn

9 min readMay 19, 2020

Prometheus is considered a foundational building block when running applications on Kubernetes and it has become the de-facto open-source standard for visibility and monitoring in Kubernetes environments. In this blog we are highlighting the most common challenges for operators of Prometheus as well as SREs, and provide guidance on how to overcome them. Finally, we are discussing a solution to get you there more quickly to build automated, future-proof observability with Prometheus showing Keptn as one possible implementation.

TLDR

Overcome the challenge of complex onboarding and ad-hoc configuration of applications by applying a GitOps approach to stay in control. Plus, get the option to revert your configurations at any time.
Use code generators instead of manually creating configuration files and dashboards. Significantly reduce the time you spend and errors you make when writing configuration files.
Too much code generation might lead to code duplication, the same is true for multiple environments with similar configuration. Use abstraction mechanisms to overcome code duplications.
Instead of building your own solution, use a cloud-native framework helping you get the job done. Keptn has GitOps at its core, is easily extendible, and automates both continuous delivery of microservices as well as operating them.

Your first starting points when operating Prometheus are most probably configuring scraping to pull your metrics from your services, building dashboards on top of your data with Grafana, or defining alerts for important metrics breaching thresholds in your production environment.

As soon as you are comfortable with Prometheus as your weapon of choice, your next challenges will be scaling and managing Prometheus for your whole fleet of applications and environments. As the journey “From Zero to Prometheus Hero” is not trivial you will find obstacles on the way.

Let us start our journey!

Challenge 1: Onboarding & configuring applications

As you know, a typical Kubernetes-based environment does not consist of only one application running in production. In fact, quite the opposite is true: Multiple instances of multiple (microservice) applications communicating with each other are running in different & multiple environments (or “stages” such as development, hardening, and eventually production) on one or more Kubernetes clusters in several datacenters spanning multiple regions across the globe. At Uber this has grown to over 4000 microservices in late 2019! To manage and operate the applications, you need full observability, demanding dedicated configurations for scraping, dashboarding, and alerting for each application. To keep the configuration in-sync between all stages and applications this is most often duplicated and manually applied in an ad-hoc manner every time something changes. The problem: huge effort for managing configurations of your Prometheus ecosystem.

Solution 1: GitOps to stay in control

Versioning all “configuration-as-code” in a centralized Git repository as the sole source of truth will help us overcome this challenge. Direct changes to the Prometheus configuration or Grafana dashboards will be prohibited and instead all changes must be committed first to the git repository and will then be synchronized to Prometheus, Grafana, and other tools. This approach is commonly known as the “GitOps” approach where a Git repository holds all configuration (as well as documentation and code) and an agent or operator applies it to the corresponding systems to be managed, e.g., Prometheus, Grafana or even a Kubernetes cluster. The benefits are manifold: versioning of all configuration plus audit logs to identify when and why each change has happened. And in case of problematic changes, we have the ability to revert them back easily. Besides, having git as the central repository, the workflow is very much aligned with developers who base their workflows on git anyways. Promoting a configuration (i.e., applying it in the next stage) is also possible using the concepts of pull requests that have proven successful for development processes already. As you can see in the figure below, a git repository plus an agent/operator are added as an intermediate layer to manage all configuration files. Obviously, the agent/operator must hold the logic and permissions to apply the configuration to the underlying systems.

Directly ad-hoc applied configuration versus configuration via a GitOps approach

Challenge 2: Manual creation of configuration & dashboards

Although we now have mitigated the issue of uncontrolled ad-hoc application of configuration files and instead have a single source of truth that is version controlled and holds all configuration as code, we still have a lot of manually written configuration. Writing (and learning!) PromQL does not come for free and that is only one piece of the bigger picture. Besides PromQL, we need Grafana dashboard configurations to have a comprehensive overview of our applications, as well as alerting rules in Prometheus to have alerting for production issues set up. Probably, a couple of engineers might be needed since writing PromQL or creating alerting rules requires different knowledge than configuring dashboards in Grafana — not only from a technical perspective but also from an organizational one. The problem: a team of engineers knowledgeable in different configuration languages is needed to write and maintain all manual configuration.

Solution 2: Code generation empowers scaling

Code generation to the rescue! Instead of manually writing queries and rules for Prometheus + alertmanager, and dashboard configurations for Grafana, we use code generators to mitigate the manual work. A great example is generating Prometheus alerting and recording rules based on SRE concepts, such as the Golden Signals or the RED method (or even the USE method) that are widely considered as the most useful and critical metrics. Another use case is to generate Grafana dashboards (for which examples can be found here, here and here). Bottom line: the usage of code generators speeds up configuration. The generated files are stored in the Git repository to have all the benefits we have discussed earlier. As you can see from the image below, a much smaller team is required since code generators do the heavy lifting and even reduce error-proneness.

Manual writing of configurations versus using code generators

Challenge 3: Code duplication

Now that we are using code generators, we come up with lots of auto-generated configuration files. When reusing those configurations stored in the Git repository, they might be simply copied from a development or hardening environment to production. The absence of an abstraction mechanism does not allow any reuse, and low-level configuration code has to be written (or actually generated in our case) repeatedly for each stage or copied from one stage to the other. If configurations slightly deviate from stage to stage this is usually not covered in any kind of templating mechanism since code generators (see challenge 2) are mostly targeting one specific platform (e.g., Prometheus rules or Grafana dashboards) only, although there are some projects to mitigate this. This is due to the fact that those systems target different technical domains under the hoods. However, I want to mention that some projects do exist to mitigate this issue.

Still, often another problem arises: changing the input for code generator 1 outputs a result that is now out-of-sync with the output of code generator 2 or 3 — there is no synchronization mechanism between all the generated files. To mitigate this, a change of one input could trigger the execution of all generators, but the actual problem is that the input for each generated file is in a different format (since the code generators are independent solutions). The problem: manual work is required to bring a desired change into each input format and to eventually create a new generation of configuration files.

Solution 3: Abstraction fosters reuse

From software engineering we learned that abstraction fosters reuse, and this very concept can help to overcome this challenge. Introducing an intermediate language to cover common concepts of monitoring can help to give a common understanding as well as a technical foundation to build upon.

As you can see in the image below, we are introducing an intermediate language that allows us to define common concepts and is able to generate the specific configuration files for the different platforms like Prometheus and Grafana, e.g., using jsonnet or our own defined language. With the help of this language we can abstract from implementation details. Naturally, this language must provide all concepts that are prevalent in the Prometheus monitoring domain. Luckily, there has been consent in recent years to focus on terminology and concepts stemming from the site-reliability engineering (SRE) community. A mature concept is to build upon the notion of “service-level objectives” (SLOs) that allows us to define objectives for each microservice. Putting this into machine and human-readable code (obviously some YAML files) allows us to generate a configuration for multiple tools but all configuration conforming to the defined service-level objective. Consequently, complexity is reduced making it easier to cope with operating and scaling your Prometheus environments. Plus, talking about SLOs allows us to bring more people to the table since no low-level technical details like APIs are discussed but instead the conversation is on the level of objectives.

No abstractions mechanism versus fostering reuse by abstraction based on SRE concepts

The ultimate (root) problem: x+1 island solutions

Although we have overcome all addressed problems, if all of you readers are now implementing your own solution based on what I have written you basically end up with many individual island solutions. It will be just fine for the problem at hand, but we have not yet thought about using Prometheus data in our CI/CD system to enable quality gates between development and hardening stages, or building on top of our newly built system with chatbots to enable self-service capabilities for developers and SREs. Even more extension can be envisioned: why not integrating testing tools like JMeter and provide dedicated dashboards out of the box for each test run? And why not even query Prometheus metrics automatically for each test run and report it back to the user each time a test is triggered? The problem: Building such a system also takes time and it is a complex effort.

The ultimate solution: Use a ready-made framework built on industry standards

We started Keptn with exactly those thoughts in mind. How to build a future-proof, extendable platform on Kubernetes mitigating all these issues by providing out-of-the-box support for configuration and management of monitoring tools and integrating them into a bigger workflow? What we came up with is Keptn — an event-based control-plane for continuous delivery and automated operations.

Keptn provides a ready-made framework addressing the challenges we have discussed earlier:

GitOps based thanks to its internal Git repository (can be connected with GitHub, Gitlab, Bitbucket, etc) that synchronizes changes to the attached tools.
Abstraction from low-level languages like PromQL to service-level indicator (SLI) and service-level objective (SLO) based configuration. Keptn helps you focus on ensuring service-level quality instead of dealing with APIs to fetch the correct metrics.
Code generation from SLI and SLO files to executable code that can be interpreted by the respective tools like Prometheus and Grafana. As a concrete example, thresholds defined in an SLO file can be reused to set up alerts in the Prometheus alert manager.

There are a couple of more things we have not covered in detail in this blog, like a mature extension mechanism, observability, and auditing mechanisms of the frameworks, as well as developer tooling like CLIs or APIs instead of ad-hoc scripting. Besides, you might even need more tools in your toolbox to also cover other aspects, such as logging.

Keep these things in mind when starting to create your own solution. Or take a look at Keptn to get these features out of the box!

Acknowledgments

Thanks to the Prometheus community for the valuable feedback on this article.

Overcoming scalability issues in your Prometheus ecosystem

Written by Jürgen Etzlstorfer