How Grafana Mimir helped Pipedrive overcome Prometheus scalability limits

Karl Martin

Published in

Pipedrive R&D Blog

6 min readAug 9, 2022

Co-authored with Rasmus Rüngenen and Tambet Paljasma

Introduction

In sales – as in life — you can’t control your results, but you can control your actions.

With that in mind, in 2010 a team of sales professionals set out to build a customer relationship management (CRM) tool that helps users visualize their sales processes and get more done. So, they created Pipedrive — the first CRM platform made for salespeople by salespeople. Pipedrive is based on activity-based selling — a proven approach that’s all about scheduling, completing and tracking activities.

Soon, Pipedrive’s team realized you can’t control your actions (or scale a business) if you don’t control your observability. This is why over 5 years ago we implemented Prometheus, which quickly became the company standard for exporting monitoring metrics.

In our stack, when a new microservice is deployed, all a developer needs to do is add the “prometheus_exporter_*” annotation and connect the correct libraries to their service. The metrics are then automatically picked up and can be queried through Grafana in a matter of one Prometheus scrape interval.

The problem: Scaling a federated Prometheus deployment

Our monitoring setup has grown along with the company, expanding from one physical location to a federated setup with three physical data centers in addition to AWS deployed in five product regions and five test/dev regions. Grafana is used across our entire company, with roughly 500 active users leveraging Grafana for visualization and alerting.

Up until this year, we were using a federated deployment model where we had a Prometheus federator in each data center scraping metrics from all services, VMs, hosts and two main Prometheis (test and live) that scraped metrics from these federators to a central aggregated location.

But around eight months ago, we started noticing problems with our live Prometheus instance, which started to crash and run out of memory for no obvious reason. Increasing resources only worked up until 32v CPU and 256 GB of memory. Adding more sources beyond that point turned out to be futile and did not fix our issues. Prometheus restarts also took up to 15 minutes to replay WALs. We couldn’t afford thesedelays , as our entire observability and alerting strategy depended on Prometheus’s availability.

For the aggregated Prometheus instance, problems started when we hit ~8 million active series, ~20 million chunks and ~200k label pairs.

As we were looking into onsite tracing and logging solutions, we landed on Grafana Tempo and Grafana Loki. Around the same time that we wanted to replace our aggregated Prometheus instances, Grafana Labs rolled out Grafana Mimir. Taking into account all the features Mimir had introduced — such as fast query performance and enhanced compactors — and our previous experience with Grafana tools, we decided to jump right in and implement Mimir into our stack.

Configuring Grafana Mimir at Pipedrive

It took a team of two engineers roughly two to three months to fully implement, scale and optimize Grafana Mimir in our setup. We started by gradually implementing Mimir while our entire Prometheus stack was still operational and replicated all the data without any changes or service disruption.

To gracefully migrate from Prometheus to Mimir, we configured our Prometheis federators to remotely write to Mimir and made Mimir available as a new data source in Grafana for everyone to test out during the PoC period.

After Mimir data was synced with all of our Prometheis, we started to redirect query requests going to our aggregated Prometheus instances to Mimir, and we gradually optimized the Mimir configuration and deployment for all users and services using Prometheus metrics. In particular, we spent some time fine-tuning the Mimir per-tenant limits as well as CPU and memory requirements for each Mimir microservice, tailoring the configuration to our usage patterns.

Pipedrive’s Grafana Mimir production cluster is currently running in Kubernetes, using the microservices deployment mode, handling the following traffic:

Between 12 and 15 millions of active series, depending on the time of day
400K samples/sec received on the write path
600 queries/sec successfully executed on the read path on average
Pod resources:
- 52 ingesters ( 1.5cpu / 9gb memory)
- 30 distributors( 1.2cpu / 1.5gb memory)
- 10 store-gateway ( 3cpu / 5gb memory)
- 2 compactors ( 2cpu / 3gb memory)
- 18 queriers ( 1.5cpu / 3gb memory )
- 18 query-frontend ( 0.3 cpu / 0.5gb memory)

The migration was fully transparent for our users and a huge win for our observability team.

The biggest benefit of migrating to Mimir was solving the Prometheus scalability limits we had been facing. Previously, our main Prometheus instances were going out-of-memory under high load. This led to on-call engineers receiving pages day and night, and the whole company losing valuable observability data until Prometheus recovered, which typically took a long time. Migrating to Mimir gave us much more room to scale as Pipedrive services started to expose more and more metrics each day.

Moreover, with Grafana Mimir’s cardinality analysis API and active series custom trackers, we were able to inspect what services expose the highest cardinality metrics, which allows us to consistently fine-tune them.

Also, if there is a single (unintentional!) malicious query, it cannot bring down the entire Mimir stack. The worst case scenario that we observed in Mimir is a few query frontend and/or querier pods getting OOM killed, but that does not have a significant impact for other requests happening at the same time. Within Prometheus, the same scenario would have led to restarting the entire service, which — in our case — is roughly 10 minutes of downtime for our entire metrics stack and all on-call engineers receiving false alerts.

Mimir per-tenant limits have been very useful, too. They help limit the amount of data ingested and queried on a per-tenant basis and prevent the whole system from being overloaded or even crashing under heavy load. Limits allowed us to solve the root cause of high cardinality metrics at the product service level without compromising the whole observability stack until the issue was resolved.

The future of the Grafana LGTM stack at Pipedrive

We continue to work on optimizing our Grafana Mimir setup within our stack and finding what configurations and limits work for us. We are also leveraging preconfigured Grafana Mimir dashboards and alerts.

Grafana combined with Grafana Mimir have become the standard tools that we use to monitor services and processes at Pipedrive. And we indoctrinate our engineers from the get go: Every new engineer joining Pipedrive receives an onboarding session explaining the basics of Grafana and Mimir to get them jump-started with our stack, which is only expanding from here. Since we’ve had a great experience with Grafana platforms so far, we are planning to test Grafana Loki and Grafana Tempo in the coming months to replace our existing logging and tracing platforms.

Interested in working in Pipedrive?

We’re currently hiring for positions in several countries/cities.

Take a look and see if something suits you

Positions include:

Front-End, Back-End and Lead Engineers
Junior Site Reliability Engineer
Principal Solutions Architect
React Native Developer
And several more…