From Deployment to Actionable Insights (In Under a Minute)

How Pipedrive enables developers to get their monitoring data rolling within the deployment pipeline.

Kristjan Hiis
Pipedrive R&D Blog
4 min readJul 16, 2020

--

Photo by Chris Liverani on Unsplash

To start things off, I’d like to shed light on a problem that we had at the very beginning of our monitoring journey in Pipedrive. If you haven’t yet, please check out the post about how Pipedrive is “Fueling the Rocket for 500 deploys per week”, as this is a center-point to the story here.

As you may know, developers usually want to get actionable data immediately after they hit the Deploy button. This is reasonable, who wouldn’t want to know, as soon as possible, if something that they are implementing is behaving as it should.

That’s where we step in, “we” being the Monitoring Platform team in Pipedrive. To give you a little background — we used to use Zabbix as our monitoring platform, coupled with Graylog for logging, and some homegrown scripts (which performed a few little magical things that I won’t go into now). During the span of the last 2 years, we improved our stack drastically and the following will give you insight on how we managed to do this.

We started off by mapping out the pain points of the day-to-day work for the developers and came to the following conclusions:

  • It was really hard to make sure each microservice has monitoring
  • You needed to contact someone from the monitoring team to set up the aforementioned monitoring
  • There wasn’t a solid process for visibility and alerting
  • It took a lot of time setting things up

What did we do to make it better?

First we researched what would be the best tools to implement and how much we could customize things to what we would need.
I believe you may already know what tools we landed on.

Grafana and Prometheus were the best choices 2 years ago and I think they’re still the best options now — they provide developers and platform engineers the flexibility that is needed for a constantly changing environment.

Our setup is complete with automation set in place — the usual workflow goes from deployment to seeing newly deployed metrics inside Grafana in under a minute(or so).

Once we came up with the solution that all the metrics that are needed to be put in place by the developers could have only one Prometheus Service Discovery tag, we started saving time.
You can always give the power to the people and let everyone alter the Prometheus config files on the fly, but this could lead to multiple other issues which is something I’ll talk about in the next post about how monitoring is done in Pipedrive.

Currently, this solution has so far served us well, but I want to mention something from the workflow above — there are some tags somewhere and something gets discovered and then scraped; what does that even mean?

To put it simply, Consul is our service discovery software and it allows us to put tag values on any of the resources that have registered to it. We have used a simplistic value of pipedrive_exporter_ServiceName where the ServiceName value is a variable.
So far, pretty simplistic right?

Right away, when Consul registers a container/service/pod to itself with the right tag, the Prometheus Service Discovery kicks in and finds a rule that we configured into it. If the tag in Consul matches the Prometheus SD rule, it gets registered into the scrape list and will get periodically gathered after this.

This all does come with a little pre-requisite though, in fact, the code that has been put to the container has to be configured to use Prometheus as a monitoring solution. After the (correct) instrumentation, the service then serves a /metrics endpoint that shows all the instrumented data in plain text for Prometheus to scrape into its Time Series Database (TSDB).

As our scrape times have defaulted to 30 second time periods, the data that should appear in Grafana will arrive in roughly under a minute.

This is merely the first step towards a better tomorrow and we are constantly trying to improve developer happiness inside our workflows. You can continue to read more about this in upcoming future articles.

--

--