In this blog I’ll deep-dive into how Wise (formerly TransferWise) is leveraging dashboards as code to monitor more than 500 microservices running on multiple Kubernetes clusters and environments.
Wise’s platform is rapidly evolving to support our customer growth and expansion into new markets: scaling up to more than 500 microservices on Kubernetes presented us with some serious technical and organisational challenges.
This post was originally written for our internal engineering blog and later adapted for an external audience.
In the past months, the Observability and Central SRE teams here at TransferWise have been quite busy building the foundations for standardised observability and instrumentation across our fleet of more than 300 services.
In my previous article we started to lay out how to implement the vision we defined for our product teams, focusing on tooling and libraries as means to achieve standardisation.
We’ve come a long way since then, creating two new internal libraries, tw-observability-base and tw-service-comms, which are the first concrete…
Thanks to James Bach, Tony Qin & Kostas Stamatoukos for their feedback and support in writing this article.
The TransferWise platform has dramatically changed in the last few years. We have moved from our Grails monolith to a microservices architecture, which lets us move faster, unlock new product opportunities, and grow our engineering teams.
While extracting more and more domains into separate microservices, the system became harder to reason about, leading to multiple red herrings during incidents.
In our previous article about Observability, we defined the term as:
…to get insights into the why services behave the way they are…
@massimo_pacher on Twitter. SRE @ TransferWise.