Why Observability is a must for product engineering teams

Massimo Pacher
Wise Engineering

--

Thanks to James Bach, Tony Qin & Kostas Stamatoukos for their feedback and support in writing this article.

The TransferWise platform has dramatically changed in the last few years. We have moved from our Grails monolith to a microservices architecture, which lets us move faster, unlock new product opportunities, and grow our engineering teams.

While extracting more and more domains into separate microservices, the system became harder to reason about, leading to multiple red herrings during incidents.

In our previous article about Observability, we defined the term as:

…to get insights into the why services behave the way they are, and to make better decisions to continuously improve the TransferWise platform, customer experience, and development processes.

And highlighted some of the issues which arise from our architectural decisions, discussing the effects of poor observability of some parts of our system.

As part of continuously trying to improve the product, we practice a blameless postmortem culture, which promotes learning from incidents and preventing them from reoccurring. There was a common theme across many of our incidents: product teams had no visibility of their services, and no common tooling to either publish metrics to, or debug from. We have since remediated this by bringing up a monitoring stack consisting of Prometheus, Thanos, AlertManager, Grafana, Jaeger, ELK, VictorOps, and others.

With this technical stack in mind, and the definition of Observability we have at hand, this article will go into some of the technical problems faced on the journey to become observable by our product teams. We will discuss how the initial process worked with just a few services, analyse the flaws when scaling to a fleet, and explain some of our practical solutions.

We will leave the Road to SLOs and the cultural SRE effort undergoing in TransferWise for another article, but we consider the following approach an important milestone to get there.

A flawed process for Product Teams — Hold & Receive case study

To give some context around the state of the TransferWise platform, we’re currently running on a set of Kubernetes clusters hosting around 250 microservices- mainly powered by Spring Boot 2- and fully owned by autonomous teams.

The Hold & Receive team, which is responsible for the TransferWise Account powering our neon green Debit Card, owns 10 microservices. As the team were building these services, we encountered a lot of problems around duplication of code and maintainability, which became quite painful over time especially when starting a new service.

Some of these services, like balance- our financial ledger- participate in most of our customers interactions (for example when making a card payment) and must be highly available in order to provide the quality of service our customers expect.

Things are going to go wrong sooner or later, so quickly triaging and mitigating the effects of an incident, while supporting the daily financial lives of millions of customers, is critical for our engineers and business.

The Hold & Receive team has put a lot of effort into making their services more observable. They have approached it from a technical and cultural point of view- because Observability is more than just your logs, your metrics, and your traces. But it has been a painful journey to get there.

For the technical side of implementing Observability, the team had to go through the following process for each new service they developed:

  1. introduce and enable Micrometer to expose common metrics (e.g. http latency, http error, etc)
  2. define Prometheus registry & endpoint with common labels, e.g. instance, app and make sure Prometheus can scrape it
  3. enable Prometheus scraping in our Kubernetes helm manifests directory (which configures a Prometheus Operator)
  4. go to our Grafana dashboards, copy paste or import some other dashboard / template and change some values
  5. set up alerting via AlertManager

The process was a great way to kickstart the Observability initiative and give us some quick results, but it has lots of flaws, which we will discuss in the following sections.

Platform concerns leaking to product teams

Our Platform team defined some guiding design principles, called The Platform Commandments, in order for teams/services/operations to scale when it comes to production.

This process represents a clear violation of one of those commandments, which states that “no infrastructure details shall be leaked to the services”.

Product teams should not have to worry about low level platform details, letting them focus on delivering values to customers through improvements or new features. Instead, most of these steps requires a product engineer to have low level understanding of instrumentation, metrics collection and deployments, introducing a non negligible overhead.

Spreading of bad / reinvent the wheel practices across the engineering team

Copy & paste driven development can lead to dangerous paths and make migration / deprecations really difficult and time consuming.
It also slows down development, because engineers have to start discussions when different configurations for the same setup are available.

This, again, violates other two of our commandments:

  • The Platform team shall provide sane defaults for all the tooling
  • The Platform shall make it hard to make mistakes

Decentralisation has a lot of benefits on development speed, but it can get out of control pretty quickly if some abstractions / standardisations are not introduced in the process.

If you have been on call for multiple services using totally different visualisations (e.g. rpm vs rps, seconds vs milliseconds) and processes (no SLIs), you might also understand the pain during an incident.

Maintenance becomes a nightmare and observability a burden

The process, which can be described as toil, is quite repetitive, error prone and hard to maintain, beside not being scalable at all: already with 10 services the team was feeling the pain, from developing to being on call.

It is a burden for the product teams to implement even the most basic instrumentation.

How can we iteratively improve this process, providing immediate value to teams while moving in the right direction? How can we move towards establishing an SLOs culture, and achieving better Production Excellence?

The Hold & Receive team approached the problem embracing a culture of risk prevention, promoting Observability to a precondition for shipping a feature to Production, and fully revisiting the process.

A maintainable approach for our observability stack

The problem can be tackled from multiple angles, but we identified several components which, combined, improve the process:

  • Service template that introduces instrumentation out of the box as well as other common patterns and libraries
  • Instrumentation as a library that configures all the necessary parameters
  • Enabling Prometheus scraping by default
  • Moving our dashboard definitions to GitHub and automating dashboard provisioning for new services (GitOps — Dashboards as code)
  • Standardise alerting for classes of services (e.g. Tier 1, Tier2) for better maintainability

Service template

Our Engineering-Experience team, whose mission is to enable engineering at TransferWise to be smooth and painless, developed an opinionated Spring Boot service starter with all the internal libraries and the internal recommended scaffolding, providing out of the box instrumentation for the most common use cases.
They took advantage of Gradle 5 and implemented an internal BOM (Bill of Materials) promoting the usage of semantic versioning for internal libraries.

This initiative not only reduced the time to production for a new service, removing a lot of friction, but also addressed points 1) and 2) in the process above, strongly contributing to standardisation.

Instrumentation as a library

We developed a thin Spring Boot library which set sensible defaults (e.g. histograms buckets, percentiles) and configure endpoint and security out of the box.
It also standardise metrics naming, a key aspect for templating dashboards and to ease monitoring of multiple services.
Finally, it provides (optional) extra instrumentation classes on top of the Micrometer ones, like thread pool executors, health status, version etc.
This library has been added to the common service template, so it will be part of any new service beside the existing ones.

Extract from the instrumentation library documentation

Enabling Prometheus scraping by default

We already had automated discovery of new targets in our clusters, but the engineers were asked to enable Prometheus scraping manually via a flag when launching a new service: we decided to enable it by default, moving towards a blacklist approach to further streamline the process and addressing point 3).

GitOps: Dashboards as code and automatic provisioning

The underlying idea of treating configuration as code is not new: no more copy paste, reuse of existing components and easy composition of your own dashboard committed and versioned on version control systems.

This approach allows defining standard SLIs / SLOs visualisations for classes of services e.g. Tier1, Tier2, promoting consistency in look & feel and sharing fixes / improvements across every tier.

Example of standard SLIs visualisation for a http-service

Grafana introduced provisioning in version 5, so you can, for example, automatically provision a JSON dashboard when creating a new service.
This solves point 4), providing an alternative way of creating / maintaining dashboards for product teams.

We will describe how we implemented GitOps on Grafana in a separate article, so watch out for this blog.

Standardise alerting for classes of services

Along the line of treating configuration as code, the team faced the same issues while defining alerts for multiple services of the same category / tier: for example, every http-service of Tier1, with a hypothetical availability SLO of 99.95%, should share the most effective alerting strategy internally identified, for instance multi-window, multi-burn-rate (see alerting at scale).

This generalisation will make the alerting strategy much more maintainable and easier to reason about, allowing to provide standard and validated alerting out of the box per service category when creating a new one.

We’re just at the beginning of our journey to a scalable alerting: we’re still experiencing pagebombs and a low signal-to-noise ratio with resulting alert fatigue, but we believe we’re moving in the right direction.

P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.

--

--

Massimo Pacher
Wise Engineering

@massimo_pacher on Twitter. Principal Engineer @ Wise.