Scaling monitoring dashboards at Wise
In this blog I’ll deep-dive into how Wise (formerly TransferWise) is leveraging dashboards as code to monitor more than 500 microservices running on multiple Kubernetes clusters and environments.
Wise’s platform is rapidly evolving to support our customer growth and expansion into new markets: scaling up to more than 500 microservices on Kubernetes presented us with some serious technical and organisational challenges.
We’ve started a long journey into standardising how we provision, monitor and deploy services on Kubernetes, defining the so-called paved road for product engineering teams. In this article, we’re going to present how we addressed monitoring a scale, focusing on dashboards and related maintenance challenges in a multi environment setup.
How we started
Our Observability team, while moving to AWS, chose to adopt Prometheus and Thanos to monitor our platform. Almost 3 years in and we’re ingesting more than 50 millions samples per minute. To visualise this data we adopted Grafana, which lowered the barrier to entry for product teams looking to visualise their metrics.
After an initial learning phase with PromQL and Grafana, product teams started writing their own instrumentation for reporting service health (e.g. how many 5xx http codes) and business metrics (e.g. how many card payments per minute). With this data, they created service dashboards via the Grafana UI to visualise “business as usual”.
This approach worked great initially, but it started showing its limitations as we scaled.
Our challenges
Product teams’ toil increased exponentially with the number of services owned. Engineers started copy-pasting dashboards, instrumentation, and alerts across services, adding a significant maintenance burden to these teams, and making it difficult to evolve.
The complexity of our platform has also grown over time as we introduced new environments and Kubernetes clusters (we’re now running around 20 different clusters!).
This resulted in teams having to run, test and support their services across multiple environments and monitoring stacks, exacerbating the problem of keeping their dashboards in sync across environments.
Consistency of visualisation and reusability also became a problem over time, as SREs and product engineers couldn’t easily triage major incidents spanning across multiple services without extra cognitive load due to different representations of the same concept (e.g. error ratio vs availability, seconds vs milliseconds, rpm / rps, etc).
Moreover, it became hard for SREs to centrally roll out fixes or updates to dashboards when changing or deprecating underlying libraries, with teams quickly falling behind.
We also realised that, creating a dashboard from the UI, was so easy that led to dashboard proliferation and stagnation (~ 1000), making it hard to understand what’s up to date or not (which is something no one wants to deal with during an incident).
Finally, we wanted to be able to reuse open source dashboards provided by the community (e.g. monitoring-mixins) without forking them.
Our solution
We’ve been using configuration as code extensively within the organisation, so what stopped us from doing the same for our monitoring dashboards too? Google SRE’s Workbook supported our idea, so at the end of 2019 we started exploring some of the options available to implement dashboards as code in Wise. This initiative moved along the standardisation effort on telemetry via shared libraries, to offer product teams out of the box monitoring for their services.
In this article we will focus on dashboards as code and provisioning, while alert provisioning will be covered in another separate one.
Dashboards as code
When Grafana announced dashboards provisioning support in version 5, we knew that would solve some of our problems. We carefully evaluated a few solutions to programmatically define Grafana dashboards like grafanalib, which is written in Python, or grafana-dash-gen, which is written in Javascript, but we decided to use grafonnet because of the flexibility offered by jsonnet.
Being jsonnet a superset of JSON, it’s possible to write json into jsonnet and merge or override the output of a function provided by the grafonnet library, reducing the maintenance and catering for custom needs not yet provided by the upstream library.
We also took inspiration from another open source community project, monitoring-mixins, which defines Grafana dashboards in jsonnet.
Requirements
We set some clear requirements while designing the solution:
- great development experience to ease adoption and contribution;
- GitOps flow, so automatic deployments on Pull Request merge on different environments to avoid drift;
- promote reusability and consistency via shared components to enable composability;
- great discoverability across environments with unique references and tagging;
- easily reuse open source dashboards and keep them up to date, allowing upstream contribution.
Let’s see how we fulfilled these requirements both from a tooling and contribution perspective.
Workflow & Tooling
We spent quite some time figuring out a frictionless flow for engineers to develop, test and ship their dashboards: we knew that we had to streamline the process given the extra work required compared to the smooth Grafana UI experience.
We optimised the workflow for a quick feedback loop and lead time, without compromising on safety: engineers can easily develop and test their local dashboards against staging or production metrics, with hot reload support, just launching a docker container.
We created a dashboard monorepo with opinionated scaffolding where teams could contribute via pull requests: we integrated automated CI checks like jsonnet linting and validation, as well as promoted meaningful code reviews via code owners.
Finally, we automated deployments across environments on the main branch merge, striving for a less than 1 minute propagation time.
But a good tooling and smooth workflow was just not enough to get many teams onboard: the entry barrier, especially around jsonnet and grafonnet, was still high enough to prevent adoption despite the benefits offered.
Boosting adoption & contribution via abstractions
We decided, then, to build abstractions on top of grafonnet to improve dashboards generation and achieve consistency and reusability: ideally we wanted to provide out of the box building blocks to promote composability, leveraging our standardised telemetry.
We started scripting and providing generic (e.g. JVM services, Node.js apps, Python services, etc.) and centrally managed (by SRE) dashboards, built on top of reusable components.
Teams could then decide to either use these generic dashboards alongside their custom business dashboards or could create their own ones mixing and matching existing components with context specific ones.
We also heavily streamlined dashboard definition for new services (with some well known layout trade off) and libraries, as visible in the example below.
In order to achieve this reusability, we also standardised variable and annotations definitions to allow contextual navigation across dashboards, i.e. keeping time range and variables when using dashboards links.
Reusing open source dashboards (and alerts)
Jsonnet flexibility let us to tap into the open source community experience and learning without losing the benefits highlighted above, allowing us to contribute back and keeping up to date without much effort.
We are actively working and experimenting on this, but we might share more in a future article.
Where we’re now
We’ve started this journey more than a year ago and we’re seeing good results: we’ve more than 100, scripted dashboards, allowing us to reliably and consistently monitor our multiple environments.
We also managed to ease central migrations and upgrades of libraries and monitoring components, making them seamless for teams and bringing them back to speed.
SRE and Platform teams are also expected to script and provision all their dashboards — we’re in the process of consolidating and migrating existing ones.
Moreover, every new library in our system must ship with out of the box instrumentation, (scripted) dashboard and alerts (but the latter is still early stage).
We’ve received good feedback from product teams so far, and we’ve already seen the benefits of the approach in more than one incident, so we’ll keep evolving the solution addressing some of the tradeoff we initially took.
Kudos to Toomas Ormisson for helping developing and improving the overall solution.
P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.