How we monitor JVM services on Kubernetes at TransferWise

Published in

Wise Engineering

6 min readJun 11, 2020

Photo by Carlos Alberto Gómez Iñiguez on Unsplash

This post was originally written for our internal engineering blog and later adapted for an external audience.

In the past months, the Observability and Central SRE teams here at TransferWise have been quite busy building the foundations for standardised observability and instrumentation across our fleet of more than 300 services.

In my previous article we started to lay out how to implement the vision we defined for our product teams, focusing on tooling and libraries as means to achieve standardisation.

What’s happened since our initial efforts

We’ve come a long way since then, creating two new internal libraries, tw-observability-base and tw-service-comms, which are the first concrete and visible results of those efforts.

The first one is a Java library which provides configuration and standardisation for all observability aspects of a Spring Boot 2.x service, enabling features like structured logging, tracing and metrics out of the box.

The latter, instead, is a much more ambitious project which aims to consolidate how we perform synchronous requests (gRPC or REST) across our services, leveraging the capabilities offered by our service mesh, Envoy.
But tw-service-comms offers much more to product teams than just consolidation or observability, embedding, among the many, resiliency practices like retries, traffic shedding and deadlines on each of our services.

We’ve also back ported this instrumentation to the few legacy Grails services we’re still running, like our public facing monolith, to achieve feature parity and guarantee triaging consistency during incidents.

These libraries are part of a much bigger and ambitious goal: to standardise and consolidate how we monitor and measure availability in the company. We defined a vision on how to incrementally get there via multiple milestones, each one having multiple preconditions to check in order to be fully deliverable.

We’ve started this journey to scale observability and SRE practices in our organisation because we can’t afford competing definitions or different interpretations. If we use the analogy of Conway’s law, the inconsistency in our processes would be passed along to our customers and partners. We also need to consider the efficiency factor of inconsistency; having a fraction of our product engineers deal with low level platform details and reinventing the wheel — writing the same instrumentation or dashboard with a slightly different flavour — is clearly a waste of their time which could be better spent on delivering value to our customers and partners with new features or improvements.

Today we’re happy to announce we’ve completed one of those milestones, introducing a major release of a new standardised dashboard for our JVM (HTTP) services.

Component view of the standard JVM HTTP service dashboard

What’s the fuss about a dashboard?!

The dashboard comes fully packed with almost everything needed to monitor the service health of a JVM HTTP services. It adds some very powerful visualisation features like canary support and traffic split by upstream / downstream, which we’ll discuss later in this post. Most of all, we’ve also tried to bake SRE best practices into its creation, designing a flow and layout which will guide product engineers in making the right decision when pressure is mounting, such as triaging an incident at 3am, or when releasing complex refactoring (cause we all know PR complexity sometimes accidentally explodes). We’ve done this because we want to enable product engineers to take a look at exactly the same data Central SRE has when looking into problems.

Last, but not least, this is one of the first generic dashboards to be scripted using our grafana-dashboards tooling — our internal dashboard as code project based on grafonnet — which unlocks multiple benefits like multi environment support or component reusability (we’ll talk about composition and reusing official components in another post). The use of a provisioned, scripted dashboard means the SRE & Observability teams will maintain and improve these officially supported dashboards- along with the underlying instrumentation. It enables us to rollout new features in a centralised way instead of having to change hundreds of customised team dashboards.

Features

In our internal efforts to consolidate our tooling and provide a consistent user experience for our developers, here are some highlights of the features we provide with this dashboard.

Canary deployments support

This is one of the features which we believe to be the most useful, because it promotes canary deployments to become first class citizens in our release pipeline.

When running a canary deployment on our internal deployment tool, Octopus, an annotation on Grafana will be visible and engineers will be able to split metrics based on primary or canary deployment by selecting their preference using a dropdown. They will be able to compare error rates between deployments or quickly assess any performance degradation introduced by the new code (with lots of caveats about long running processes and different traffic weight — but let’s forget about that for now).

*Canary release with panels showing error rates*

Upstream / downstream traffic split

Some drive-by terminology:
Downstream: A downstream host connects to Envoy (which powers our service mesh layer), sends requests, and receives responses.
Upstream: An upstream host receives connections and requests from Envoy and returns responses.

One of the most requested feature by our engineers was the ability to track consumers of legacy and deprecated APIs (as maintaining multiple versions isn’t exactly fun…)

Tw-service-comms makes it possible to see this information, thanks to context-propagation via our service mesh. Once the library is added to your service, you’ll be able to filter ingress and egress traffic of your service using Envoy upstreams or downstreams.

*Overview of upstream traffic split — balance → pricing — both at service and mesh level, letting engineers spot underlying Envoy retries (on idempotent calls) and compare response codes.*

*Example of downstream traffic split — rest-gateway → balance — which shows endpoints statistics coming from rest-gateway traffic.*

Kubernetes insights and specific dashboards deep linking

The dashboard gives also a quick overview of Kubernetes usage in the form of golden signals (e.g. how close a service pod is to be killed cause of OOM), referencing specific platform dashboards via panel links in order to be able to dive deeper when needed.

Service overload and deadline visibility

Our tw-service-comms library defines overload protection and traffic shedding policies based on request criticality (e.g. CRITICAL vs SHEDDABLE). This will let our engineers monitor the saturation level of their service and which requests will be dropped to ease recovery and availability. The dashboard also provides visibility over deadline expiration, allowing to spot bottlenecks in the call chain.

*Service dropping traffic and discarding useless work during overload*

What’s next

We’re opening up this work internally in order to get feedback from daily usage and improve it while delivering value to our product engineers.
We hope this will spare teams from dealing with low level platform details — like implementing custom error monitoring — and let them focus on their own business metrics / dashboards as a first step to getting ready for Service Level Objectives.

We’re also currently adding gRPC support for the services powering our green debit card and Node.js for our (server side rendered) frontend.

We’ll continue on the journey to consolidate our engineering experience and improve, at scale, the reliability of our products.

P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.