Prometheus Monitoring at Scale: War Stories from the GumGum Trenches
Monitoring a system, application or any other IT component is one of the most basic and critical, yet often under prioritized IT process that all companies deal with.
The monitoring landscape has changed in ways that nobody would have believed in the last 10–15 years. The widespread popularity of cloud native applications, microservice-based systems and sprawling ephemeral environments introduce a host of new challenges, especially for legacy monitoring platforms that were designed for more static environments.
The Good Ol’ Days
A few years ago, monitoring at GumGum was pretty much a fairy tale: applications, servers, and services were pretty much fixed and well-known. Icinga, Nagios, Ganglia, AWS CloudWatch, and some other off-the-shelf SaaS tools were just right for the monitoring needs and challenges that we faced during those days.
We collected host metrics from a predefined list of clusters, services and applications. We’d send alerts based on some basic thresholds that usually indicated something was wrong. For some services, we created basic Ganglia and CloudWatch dashboards. That was it. Indeed, it was a simpler time — dashboarding was still new and the business didn’t yet have the appetite for real-time metrics that it does today.
Onboarding with Docker
But, as we mentioned earlier, the IT landscape at GumGum is vastly different nowadays: we have fully embraced the microservices pattern and the Docker ecosystem, with all its advantages like portable applications, homogeneous environments, fast and easy scaling, rapid deployments and more.
While these are great improvements for the business, these patterns often come at the cost of increased complexity for the GumGum DevOps team. The explosion in the quantity and complexity of the systems to be monitored was neither manageable nor sustainable with the legacy monitoring stack that we had in place.
For more details on what the legacy monitoring stack at GumGum looked like and the challenges it presented, check out this presentation thanks to Florian Dambrine.
While we had a very good run with our old monitoring stack at GumGum, we needed something new to address the complexity and scale of our new containerized ecosystem.
Prometheus to the Rescue!
We needed a new monitoring solution for GumGum, taking into consideration the technical challenges and requirements for a containerized landscape. To accomplish this, we analyzed a variety of approaches and tools and decided to implement a monitoring stack based on Prometheus, an open-source monitoring solution based on a Time Series Database (TSDB) and a pull-based pattern for gathering (scraping) metrics from the target systems. We chose Prometheus among other systems, mostly because of its flexibility, extensibility and huge open source community backing the project (Cloud Native Computing Foundation).
Prometheus alone is not a complete or fully-fledged monitoring stack for a company of our size (100s of people, dozens of microservices, 1000s of containers). Prometheus is built for scrapping metrics with a well-defined format from its target systems, which are collected and exposed to Prometheus via HTTP using specialized agents called Exporters. The following diagram depicts a birds-eye view of GumGum’s batteries-included Prometheus solution, including the components necessary for a highly available deployment.
Thus, for completing the monitoring puzzle, we make use of open-source specialized tools that are loosely coupled together, each one being the best fit for its particular use:
Data Visualization and Dashboarding: Grafana is the open-source analytics & dashboarding solution that can be easily coupled to Prometheus using native Grafana data sources. For building Prometheus dashboards on Grafana, you are encouraged to use the powerful PromQL language, which allows you to perform complex operations and aggregations over the data being queried.
Alerting: Alert Manager is the Prometheus-native solution for creating, managing, sending and silencing via a variety of methods such as email, on-call notification systems like PagerDuty, and chat platforms such as Slack.
- Consul is a flexible service mesh solution that can be used for multiple purposes, a popular one being service discovery. Prometheus can be easily integrated with Consul, so that it will have a catalog with all the services and nodes to be monitored. Each node to be monitored has a Consul agent installed in it, so the Consul agent will register itself to the Consul cluster, letting the cluster know what services need to be monitored.
- Registrator: Registrator is a service registry bridge for Docker. It can automatically detect and register/deregister Docker containers as they come online and disappear from the host. Registrator agent running on each ECS instance is reporting to the local Consul agent, so the Consul cluster will know every time a new Docker container appears, and thus Prometheus will be aware of it as well.
- Node Exporter: It is a Prometheus exporter for exposing hardware and OS metrics on *nix systems. By default, Node Exporter publishes more 600+ metrics, which is a double-sided sword. You gain access to a huge amount of metrics and valuable info about the system, on the other side, Prometheus will have to scrape and process a ton of different metrics per system, which could potentially become an issue known as Prometheus Cardinality Explosion.
- cAdvisor: Similar to NodeExporter, cAdvisor (Container Advisor) is a daemon or service that collects, aggregates, processes, and exports information about running Docker containers.
- JMX Exporter: An exporter that can be run as a Java Agent and it’s able to collect and expose JVM and JMX metrics in Prometheus format. We’ve used the JMX Exporter @ GumGum to replace DataDog for monitoring Zookeeper.
- Spring Boot Actuator: Custom Java exporter for applications built using the Spring Boot framework. It can retrieve metrics from the JVM, the host and user-defined Java metrics (Similar to JMX Exporter).
- CloudWatch Exporter: This exporter can retrieve metrics from AWS CloudWatch and then it exposes them so Prometheus can scrape properly. Take into account that this process has an extra cost, because of AWS API calls.
- Prometheus Pushgateway: Pushgateway is not a Prometheus Exporter per se, but it behaves like that. Pushgateway allows ephemeral and batch jobs to expose its metrics to Prometheus. The batch/ephemeral jobs push the metrics to Pushgateway and the Prometheus scrapes the metrics from it.
Prometheus ecosystem has a lot of different exporters that you could potentially implement for your application/stack, such as PHP, Python, NodeJS, etc. We could even implement the NewRelic exporter to have AdServer APM metrics on Prometheus.
Putting all the pieces together
So far, we have covered all the jigsaw pieces on the Prometheus monitoring puzzle, but we should understand how the pieces are glued and deployed together: Terraform, ECS and Docker.
We are not gonna explain with that much detail the Terraform project created for setting up all this infrastructure, it is out of the scope of this article. The Terraform project consists of a Terraform module (tf_monitoring) and submodules (alertmanager, cloudwatch, grafana, monitoring-cluster, prometheus and pushgateway). It is worth mentioning that the Consul cluster required by Prometheus is managed on a separate Terraform project, as it is being used by other services.
All services/components are run on a single ECS cluster, one per AWS account (Advertising, Sports and AI/Verity), so each AWS account has its own Prometheus Monitoring Stack, fully independent.
So far so good. Prometheus is up and running, we are collecting metrics from the servers, applications and containers, we are able to create beautiful and meaningful Grafana dashboards with those metrics. But not everything is a fairy tale, as we would like it to be.
The way we architected, configured and deployed Prometheus and all the other pieces have some challenges and drawbacks that are really worth mentioning.
Biting the dust
Now that we have now a general understanding of the GumGum Prometheus architecture and implementation, it is worth mentioning the lessons learned so far and the challenges and drawbacks found.
- Configuration Changes. Prometheus, Alertmanager and Cloudwatch Exporter configuration is hardcoded inside the Docker container running each one of those services, so, any change to the configuration requires a change to the Bitbucket repo managing the specific Docker Project (docker-prometheus, docker-alertmanager, docker-cloudwatch), then after the new Docker is built and pushed to the JFrog Repo, then, we need to modify the Terraform project managing the whole Prometheus Cluster, apply the changes and wait for ECS to update/recycle the modified service (i.e. Prometheus), incurring in an unwanted downtime for such a critical service like Prometheus. At the time of writing this article, we DevOps team are thinking of a way of performing configuration changes and hot reloads to apply them.
- Prometheus High Availability: Prometheus High Availability is achieved by having 2 “Twin” Prometheus Servers with the very same configuration, scraping or collecting the very same metrics on the same hosts. It’s basically duplicated monitoring. This pattern is effective for achieving HA on Prometheus, but it has some drawbacks/disadvantages worth mentioning it:
- The monitored servers/services/applications need to perform some extra work to deliver their metrics to a pair of Prometheus servers, thus having a small performance overhead.
- The way each Prometheus service is configured at ECS level requires having separate EBS volumes for storing metrics, which means paying more for AWS storage and resources.
- When Grafana is querying metrics from Prometheus for building dashboards, we could potentially have inconsistent information, as Grafana is querying a pair of Prometheus “twin” servers.
- Scaling Prometheus to scrape millions of metrics from thousands of nodes with the current architecture/deployment is not feasible.
How to improve Prometheus HA Architecture to be able to scale out? There are other Architectures proposed for scaling out Prometheus, such as the Federated Prometheus Architecture.
- Prometheus Metrics Storage. The standard storage method used by Prometheus requires space on local disks, presented to the server, meaning AWS EBS volumes for GumGum deployment. In addition to that, Prometheus saves all its metrics to just one location. This limitation does not allow Prometheus to easily scale up in terms of storage when the company starts growing the number of metrics, nodes and applications being monitored. In addition to that, the retention time for the stored metrics is quite limited to the space available on the EBS volumes. The retention time for the metrics is set to 15 days, which is the default value. Even the official documentation mention these limitations with the storage. There are different options to be considered for overcoming these limitations, like Prometheus Remote Storage, which allows using different storage options, such as InfluxDB, OpenTSDB, Cassandra, etc.
- Prometheus Cardinality Explosion: We briefly mentioned this issue before. The main Prometheus server for Advertising Account has ~2.221.978 different time series right now!. That’s a huge number, and it could be way bigger in a matter of months. Prometheus creates a time series for every combination of a metric+tags/labels. So, for every Docker container that spins up on ECS it will create N new Time Series (N being the number of metrics collected for that particular container). We love having a lot of metrics to work with, but we care for Prometheus performance and resources as well. This explosion in the cardinality could potentially lead to:
- Severe memory inflation in both Prometheus
- Increased scrape durations
- Querying becomes effectively impossible
Is there anything to do about this? For sure! We could work on effectively scraping the really needed metrics and discard duplicated or meaningless metrics. There are ways of doing that at both the exporters and/or Prometheus level. There are tools that help in analyzing the cardinality on Prometheus TSDB, check this article.
We DevOps team have a lot of work to do about monitoring, the job that has been done so far with Prometheus has been amazing (Thanks Flo) We’ve learned a lot on the road. Following are some general monitoring thoughts worth considering:
- Prometheus is not a monitoring silver bullet, it’s great though.
- Monitoring is not a role or job, it’s a skill everyone at GumGum should embrace as part of the software development process. The Ops team is responsible for designing, building and maintaining the core monitoring infrastructure while the development teams should define -with DevOps guidance- meaningful KPIs, dashboards and alerts to reflect the value and impact on the business
- We need to pay more attention to metrics and alerts closer to the end-users: application response times, HTTP errors, etc. You can monitor as deep and wide as you want, but always keep in mind: How will these metrics show me end-user impact?
- Individual instance/host metrics and alerts are not that important in modern microservices architectures and ephemeral environments.
- Metrics may have different meanings and relevance depending on the context, workload or application.
Thanks to Ops team, especially to Corey Gale for helping me with the redaction.