The GitOps Way for Consistent Monitoring

Lior Lieberman
Riskified Tech

--

Anyone who has worked within Kubernetes probably knows that tools like Prometheus and Grafana are great to monitor containerized environments, and among the most common choices to be deployed in your cluster.

Prometheus & Grafana integration with Kubernetes is great, however as an Infra team, there is still some overhead configuring and maintaining the monitoring stack in all environments.

The main components in a basic Prometheus monitoring stack are:

  • Prometheus — for metrics collection and aggregation
  • Grafana — visual UI for exploring and creating metrics dashboards
  • Alertmanager — responsible for group and route alerts to many integrations, such as email, Slack, and Pagerduty.

In this blog post, we’ll focus on simplifying the work with Grafana

Maintaining multiple dashboards

Grafana is indeed an amazing, powerful tool to create, explore and share your data through beautiful, flexible dashboards — no matter where it’s stored.

However, growing as a company brings new challenges in maintaining a large number of dashboards. Let’s take a look at some of them…

Inconsistent dashboards in different environments
Many times we update a dashboard for a specific need we have at the time, and then forget to update it in all other environments. This creates inconsistency across environments, and therefore it makes monitoring harder and sometimes even inefficient. A common example is that most changes to Grafana dashboards are made in the production env, and staging workloads left terribly under-monitored.

In some cases, outdated dashboards caused us to miss issues in staging environment, and faulty code reached to production.

Ownership
Dashboards do not have clear owners. Development teams create and change dashboards, but migrating the dashboards is the Infrastructure team’s responsibility.

Infrastructure teams are still the go to for helping with dashboard maintenance.
When it comes to disaster recovery, creating a new environment, or migrating existing one to a new cluster, Infra teams should restore everything to the desired state and prepare the necessary adjustments.

ChangeLog
As the Development department grows, the dashboards change more often. Sometimes, one can change multiple dashboards in different environments for a specific case or test.
Grafana has its own change management, however, it is per dashboard. We lack a centralized place, where we can see all changes made by someone, with the ability to rollback everything at once if needed.

Lost dashboards
As mentioned above, many times developers create dashboards for a specific, current need, and when they need this dashboard again later they forget which name they’ve given it. Requests to the Infra teams for finding dashboards keep coming.

Dashboards, ConfigMaps & GitOps

I have a dream — what if we could put the dashboards in one place and it will magically appear in Grafana — in all the environments…

Luckily, Grafana version 5.0 introduced a new active provisioning system using config files. Meaning that we are now able to provision dashboards from config files (config maps in Kubernetes). Grafana will auto-load them, without further need to rollout the deployment (like we used to when working with configmaps in Kubernetes).

We now need to find a way to distribute configmaps from one place to all environments.

GitOps methodology can help us achieve that, as we are going to store the dashboards configmaps in git, any change merged to the main branch will automatically be applied to our system.

There are many GitOps tools out there like Flux, Jenkins and more — we chose ArgoCD.

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes.

ArgoCD continuously watches a specified github repository, comparing it with the currently deployed manifests in the cluster, and effectively runs kubectl apply -f <all> on them.

For a wide range of use cases, this results in an incredibly simple and visible continuous delivery process — and in our case continuously deploying Grafana dashboards within configmaps.

New GitOps workflow

We created a Helm chart that reads a directory with all the dashboards in json. The Helm template creates a separate configmap for each file within the directory.

When one needs to make changes to a dashboard, they first edit the dashboard in the UI like they used to do (unless they really want to write thousands of lines in json), and then when they try to save it (in Grafana UI), they get the json describing the dashboard popping up onto the screen.
This is the json they should copy into git.

Whenever they push the changes onto master/main branch, ArgoCD recognizes the diff and the CD process will start.

We at Riskified are avid practitioners of Agile & DevOps delivered through increasing dev independence. This new workflow strengthens our DevOps culture and brings various added values with it:

More developer independence, Less overhead for the infra teams
All monitoring dashboards and metric assessments are handled by the Development Team. The team is responsible for creating the dashboards, reviewing peer dashboards commits, and instantly deploying it to all desired environments.

Consistency
Since dashboards are now stored in git, ArgoCD will monitor the changes and deploy them to all defined environments.

No more lost dashboards
They are all available in the team’s github repository.

Increasing velocity
Create and test dashboards in one environment, once merged into master/main branch the dashboards are released automatically.

Wrapping up

Consistent and accurate monitoring was never an easy goal to achieve. If you face similar challenges, whether they are monitoring related or not, you are more than welcome to share them in the comments section or reach out via LinkedIn.

--

--

Lior Lieberman
Riskified Tech

Site Reliability Engineer, Kubernetes tutor, and former handball player