Scaling monitoring dashboards at Wise

Massimo Pacher
Apr 21 · 6 min read

In this blog I’ll deep-dive into how Wise (formerly TransferWise) is leveraging dashboards as code to monitor more than 500 microservices running on multiple Kubernetes clusters and environments.

Photo by Benjamin Child on Unsplash

Wise’s platform is rapidly evolving to support our customer growth and expansion into new markets: scaling up to more than 500 microservices on Kubernetes presented us with some serious technical and organisational challenges.

We’ve started a long journey into standardising how we provision, monitor and deploy services on Kubernetes, defining the so-called paved road for product engineering teams. In this article, we’re going to present how we addressed monitoring a scale, focusing on dashboards and related maintenance challenges in a multi environment setup.

How we started

After an initial learning phase with PromQL and Grafana, product teams started writing their own instrumentation for reporting service health (e.g. how many 5xx http codes) and business metrics (e.g. how many card payments per minute). With this data, they created service dashboards via the Grafana UI to visualise “business as usual”.

This approach worked great initially, but it started showing its limitations as we scaled.

Our challenges

The complexity of our platform has also grown over time as we introduced new environments and Kubernetes clusters (we’re now running around 20 different clusters!).

This resulted in teams having to run, test and support their services across multiple environments and monitoring stacks, exacerbating the problem of keeping their dashboards in sync across environments.

Consistency of visualisation and reusability also became a problem over time, as SREs and product engineers couldn’t easily triage major incidents spanning across multiple services without extra cognitive load due to different representations of the same concept (e.g. error ratio vs availability, seconds vs milliseconds, rpm / rps, etc).

Moreover, it became hard for SREs to centrally roll out fixes or updates to dashboards when changing or deprecating underlying libraries, with teams quickly falling behind.

We also realised that, creating a dashboard from the UI, was so easy that led to dashboard proliferation and stagnation (~ 1000), making it hard to understand what’s up to date or not (which is something no one wants to deal with during an incident).

Finally, we wanted to be able to reuse open source dashboards provided by the community (e.g. monitoring-mixins) without forking them.

Our solution

In this article we will focus on dashboards as code and provisioning, while alert provisioning will be covered in another separate one.

Dashboards as code

Being jsonnet a superset of JSON, it’s possible to write json into jsonnet and merge or override the output of a function provided by the grafonnet library, reducing the maintenance and catering for custom needs not yet provided by the upstream library.

We also took inspiration from another open source community project, monitoring-mixins, which defines Grafana dashboards in jsonnet.

The building blocks of our solution

Requirements

  • great development experience to ease adoption and contribution;
  • GitOps flow, so automatic deployments on Pull Request merge on different environments to avoid drift;
  • promote reusability and consistency via shared components to enable composability;
  • great discoverability across environments with unique references and tagging;
  • easily reuse open source dashboards and keep them up to date, allowing upstream contribution.

Let’s see how we fulfilled these requirements both from a tooling and contribution perspective.

Workflow & Tooling

We optimised the workflow for a quick feedback loop and lead time, without compromising on safety: engineers can easily develop and test their local dashboards against staging or production metrics, with hot reload support, just launching a docker container.

We created a dashboard monorepo with opinionated scaffolding where teams could contribute via pull requests: we integrated automated CI checks like jsonnet linting and validation, as well as promoted meaningful code reviews via code owners.

Finally, we automated deployments across environments on the main branch merge, striving for a less than 1 minute propagation time.

High level workflow definition

But a good tooling and smooth workflow was just not enough to get many teams onboard: the entry barrier, especially around jsonnet and grafonnet, was still high enough to prevent adoption despite the benefits offered.

Boosting adoption & contribution via abstractions

We started scripting and providing generic (e.g. JVM services, Node.js apps, Python services, etc.) and centrally managed (by SRE) dashboards, built on top of reusable components.

Teams could then decide to either use these generic dashboards alongside their custom business dashboards or could create their own ones mixing and matching existing components with context specific ones.

We also heavily streamlined dashboard definition for new services (with some well known layout trade off) and libraries, as visible in the example below.

Snippet of dashboard definition for a service

In order to achieve this reusability, we also standardised variable and annotations definitions to allow contextual navigation across dashboards, i.e. keeping time range and variables when using dashboards links.

Reusing open source dashboards (and alerts)

Reusing k8s community dashboards with custom overrides

We are actively working and experimenting on this, but we might share more in a future article.

Where we’re now

We also managed to ease central migrations and upgrades of libraries and monitoring components, making them seamless for teams and bringing them back to speed.

SRE and Platform teams are also expected to script and provision all their dashboards — we’re in the process of consolidating and migrating existing ones.

Moreover, every new library in our system must ship with out of the box instrumentation, (scripted) dashboard and alerts (but the latter is still early stage).

We’ve received good feedback from product teams so far, and we’ve already seen the benefits of the approach in more than one incident, so we’ll keep evolving the solution addressing some of the tradeoff we initially took.

Kudos to Toomas Ormisson for helping developing and improving the overall solution.

P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.

Wise Engineering

Posts from the @Wise Engineering Team