Metrics Management with Google Cloud Managed Service for Prometheus
Maisons du Monde is a furniture and home decor company that was founded in France over 25 years ago. We have 360 stores across France, Italy, Spain, Belgium, Luxembourg, Germany, Switzerland and Portugal. We have a team of Operations Engineers and Site Reliability Engineers (SRE) that manage the MDM website as well as our APIs and systems that run omnichannel services such as orders, users, and carriers. Having telemetry data like logs and metrics is critical to our ability to run secure and reliable applications and services.
Commonly, organizations approach metrics using either a full service monitoring and metrics storage tool, such as Zabbix or Centreon, or they use a database like InfluxDB to store their metrics and then display them using a visualization tool like Grafana.
For the past 5 years, we’ve been using managed monitoring and storage services from several vendors for our metrics. As we moved more of our operations to managed Kubernetes, we evaluated new metrics platforms that would allow applications to expose granular, Kubernetes-specific metrics. About 8 months ago we decided on Prometheus. It is a good fit for our environment, which contains cloud native applications, built on Kubernetes, run on ephemeral compute infrastructure. Having a dedicated application metrics environment made it easier for our application teams to own the metrics, instead of a central operations team. In the new paradigm of microservices built in Kubernetes, it is our belief that ownership of the metrics should reside with the application teams. It allows developers and product owners to maintain the metrics they deem essential for alerting and dashboarding.
Advantages of using Prometheus
Prometheus has been widely adopted by organizations that are dealing with paradigm shifts such as microservices, cloud native development, interoperability between multiple monitoring solutions, and auto-discovery of services.
Its architecture is natively extensible by design which is extremely interesting:
When using open source Prometheus in production environments, we find the following attributes very useful:
Prometheus is built to be pull-based, though it is possible to add a push-based behavior via an external component. PushGateway can be used for specific workloads such as Jobs in Kubernetes, for example.
Thanks to its API, we can connect external dashboarding tools (e.g. Grafana). From our Grafana dashboard, we use Prometheus query language (e.g. PromQL) to query metrics from Prometheus instances and display them on rich dashboards.
Prometheus gets metrics in pull-based mode by default from URLs which are exposed in HTTP. This URL is generic by design in order to let apps expose their metrics:
Alerting is supported natively with Prometheus AlertManager. Prometheus’ internal rules configuration files, set up with PromQL and YAML, create alert conditions and notification channels for fired alerts.
Prometheus has its own web UI which is not simple but can be useful for testing or debugging purposes. The option that is more widely used, including by us, is using a Prometheus-compatible web UI such as Grafana.
Prometheus works well but we need more
Our experience running open-source Prometheus was great when we first started out. However, as we deployed it on an increasing number of Kubernetes clusters used to run our production applications, we ran into some constraints. These included:
Support for scaled management
More organizations are leveraging infrastructure-as-code to deploy and manage resources because it is more efficient and results in fewer errors. We need a simple way to deploy Prometheus in each Kubernetes cluster by policy.
Prometheus’ default time series database retention is set to 15 days. The database retention is configurable, but it will increase your costs and resource consumption to keep your metrics on disk for longer periods. We need a better way to manage metrics retention for longer periods of time for all our Kubernetes clusters and applications.
Backup/Disaster Recovery and restoration
Business continuity considerations are important for any service used in production. In Prometheus we found disk failures and backups to be a pain point. We need ways to scalably backup and restore data on Prometheus instances when failures occur to avoid data loss.
Furthermore, Prometheus doesn’t offer a native sharding feature, which may be a strength from an administration or deployment point of view, but ends up being a weakness if you have multiple clusters to monitor.
Prometheus stores rules within a static file which means you have to reboot your Prometheus instances in order to apply rules file updates.
Using Prometheus and Thanos to address some (but not all) needs
As we’ve seen, Prometheus is known for its simplicity and reliability. But at scale, there are some limits such as metrics retention and storage.
To address some of these constraints, we adopted Thanos, an open-source project released in 2018 by Improbable. It helps us with multi-cluster management and data storage by sending Prometheus metrics to Object Storage such as Google Cloud Storage, Azure Blob Storage or AWS’ S3.
Like Prometheus, Thanos’ architecture is extensible by design:
Advantages to using Prometheus and Thanos together
Unlike Prometheus, Thanos is query-based instead of collection-based. Thanos sidecars are deployed alongside Prometheus instances and gather only metrics they are asked to expose.
It’s important to read the documentation which explains clearly each role: https://thanos.io/tip/thanos/getting-started.md/
If Prometheus retention has been configured, and metrics aren’t available on the local disk, it will ask its Store Gateway component to retrieve the metrics from the remote storage location.
With this feature, we can address the metric retention issue raised with standalone Prometheus.
Thanos allows us to set up a global view of our multi-cluster environments, whereas Prometheus could not. This requires us to set up one Querier per Kubernetes cluster and one Querier “federator,” which you can see in the diagram below.
The Querier components can be added to our multi-cluster environments via the addition of a simple configuration (see example code below) to get a global view of our metrics.
Prometheus and Thanos aren’t perfect
Thanos helps us a lot by dealing with issues raised by standalone Prometheus. However, it comes with a lot of components which increase complexity.
The multi-cluster, global environment that we described above requires engineering resources and time to set up and maintain. Our engineers’ time is very valuable and we would rather spend it developing new features instead of maintaining a state-of-the-art metrics system.
Increased infrastructure load
Configuring each Kubernetes cluster with the Thanos Queriers to enable remote storage leads to increased network bandwidth consumption. In addition, we have now more components added to Prometheus which means more system consumption (CPU, RAM).
Google Cloud Managed Service for Prometheus
In October of 2021, Google Cloud released the public preview of Managed Service for Prometheus which aims to be a drop-in replacement for an existing Prometheus stack. We now use Google Cloud’s service to monitor and manage alert notifications for our workloads with a fully managed service that does not require management or maintenance.
Metrics for the service are retrieved by collectors, which are a fork of the open source Prometheus technology. The collectors send metrics to Google’s global time-series database named Monarch, removing the need for Thanos.
Google Cloud gives us two modes for using Managed Service for Prometheus. In our case we are using managed collection, which allows us to reduce the complexity of deploying and managing Prometheus instances. Managed Service for Prometheus provides an operator to configure Custom Resources (CRs) for scraping metrics, evaluating rules, and more. All Prometheus operations are handled by the Kubernetes operator.
In addition, this solution supports more current Prometheus use cases (e.g. migrating from ServiceMonitor to PodMonitoring scrape configs).
We want to focus our attention on building a functional and strategic metrics-based operations practice, instead of building a competency in managing long-term storage and Prometheus infrastructure. Because we expect our metrics data to steadily grow alongside our company’s growth, we know that managing metrics at scale ourselves will become very painful. Google Cloud Managed Service for Prometheus helps us achieve scaled metrics infrastructure in a straightforward way, as a managed service, without devoting hundreds of servers to this effort.
Managed Service for Prometheus is not a perfect solution — it can only be deployed using the Google Cloud Console, gcloud cli, or the kubectl tool, although we hear that Terraform support is coming. You may need to add additional engineering resources if you want to deploy it using Helm charts.
We chose Google Cloud Managed Service for Prometheus because it allows us to focus on using our metrics instead of managing metrics infrastructure. It provides us with:
- Long-term retention of metrics
- Seamless support for high availability of Prometheus instances
- Scraping and evaluating rules using lightweight Kubernetes Custom Resources
- A global query view
- An out of the box fully managed solution
We are currently in the process of bringing Managed Service for Prometheus to our production environments, having been configured and tested in our development environment.
Path to Production: Our Helm Charts
To automate deployment of Managed Service for Prometheus, we created Helm charts and implemented them in our cluster with Terraform and Terragrunt as described in detail below:
- Design a Helm Chart for Google Cloud Managed Service for Prometheus Operator which includes:
- Manifests such as (Deployment, Service, ClusterRole, ClusterRoleBinding, OperatorConfig)
- Create a Terraform external module which is responsible for:
- Adding Google Cloud IAM authorization to Google Cloud service accounts, which is linked to Kubernetes accounts
- Creating Managed Service for Prometheus CRDs that are required by the service’s operator using the Terraform kubernetes_manifests resource
- Deploying Operator Helm Chart (see below) using the Terraform helm_release resource
- This Terraform external module is called from a Terraform/Terragrunt project that manages all of our infrastructure-as-code from development to production environments.
- Design a Helm Chart for the Managed Service for Prometheus Frontend which includes:
- Deployment, Service Kubernetes manifests
- Create a Terraform external module which is responsible for:
- Adding Google Cloud IAM authorization to Google Cloud service accounts which is linked to Kubernetes service accounts
- Deploying a Frontend Helm Chart within our common GKE cluster close to Grafana instances per environments
We spent a bit of engineering time designing our Helm Chart and Terraform external module. But now, we are more efficient, and the maintenance runtime is painless.
Below, you’ll find an example of our Helm charts and the Terraform module we’ve used.
Moreover, we have deployed gitlab-ci-pipelines-exporter which gets metrics from Gitlab API (such as pipeline or deployment information) and we use Managed Service for Prometheus to scrape this exporter. Then, we display the data through some awesome Grafana dashboards.
Our plan is to offer monitoring as a service for our developers by adding Managed Service for Prometheus objects within Helm charts with preconfigured channels.
Article co-written with Michael Lopez and Gmarceau