Observability At TransferWise — The Problem with Not Being Observable

Published in

Wise Engineering

5 min readOct 22, 2019

What does it mean to be Observable?

The goal of observability is both technical and social. The former is to expose and measure the results of the inner-workings of services- their internal state. By doing so we get to do the latter; to ask better questions of the platform and its parts.

To be observable then, is to get insights into the why services behave the way they are, and to make better decisions to continuously improve the TransferWise platform, customer experience, and development processes.

This blog post is going to be one in a series documenting the journey we at TransferWise are taking to become Observable. It aims to be a guide and aspiration for where we want to go, and how we should get there, and the troubles we faced along the way.

Problem Statement

Before jumping into the solutions, we need to provide context for why we need to be observable. These are observations of problems we are currently facing at TransferWise, and some comments on how we might action them.

Sometimes we don’t know when we’re down until someone tells us

When customers use our platform they have an intuition about the level of service to expect from us. When a customer requests a quote, they expect it back quickly. When they add money to their account, they expect to get it without issue. When they pay for things with their card, they expect it to just work. And more often than not we are able to provide the service they expect.

But what happens when we can’t? Below is an example of such an event that happened recently:

On 2019/10/09, 14:04, an outage occurred between two of our services which resulted in no conversions for any of our customers until the problem was fixed at 14:54. An incident follow-up revealed the following:
- Alerting had been turned off,
- The only reason we even knew this was a problem is because some customers had complained that they weren’t seeing converted balances in their account.

Testing of our Quote/Auth/Transfer/Funding (QATF) flow showing an outage — Our payment end-to end test showing an outage

Thankfully, the lack of observability and alerting issue was resolved immediately after this incident. But it highlights the fact that as our platform becomes more complex, it’s becoming harder to identify where we’re not observable — until it’s too late. We shouldn’t need to rely on our customers telling us when we have problems within our platform.

It is necessary then to instrument- to measure- the internal states of your service, and to use the data to in meaningful ways, like through being notified of problems, and debugging.

Being opaque leads to wasted effort

How can we be sure that what we’re working on right now is what really needs to be worked on? And how do we know what we’re focusing on will deliver the results we need?

As a practical example; a core database of ours is overloaded, and adding resources only bought us time. So, we needed to tackle the performance issues by looking at the services making requests to it. But how do you do this when you can’t answer the question of ‘What services are abusing the database to cause the load?’

You could ask the database ‘tell me about the slowest queries you have running,’ but that wouldn’t tell you where the slow query came from, or the context for why it happened. And since a lot of services run with an ORM, figuring out what the query was doing without substantial reverse-engineering is a waste of your time — unless you like this sort of thing.

But, we do know what services use this particular database. So why not instrument them, and expose database information? And so, we created an instrumentation drop-in library to expose that info.

*Top 10 calls to the database over a 24 hour period*

Since the library instruments database calls, it exposed exactly what methods inside the services were causing major load. When you have this level of detail available to you, you can start to have conversations about where to focus specific development effort, and identifying what needs to be optimised in the future.

*Developer finding bottlenecks after database-calls instrumentation*

It is important then to not only instrument your service to be able to see what it’s doing, but to also expose its business logic. This means you’ll have better insight into where you should focus your efforts, and be able to react better when your service isn’t responding correctly.

Being observable is more than the tools

As a technology company it’s very easy for us to emphasise the use of tools to solve particular problems, especially during outages or bumpy releases. Calls for more tooling tend to follow traumatic or avoidable outages, where the particular line of thinking ‘if only we had more tools! We could have caught this outage!’ prevails — but rarely actioned upon.

This type of problem-solving is a symptom of looking too deep into a puzzle. We’re engineers. We get paid for this mindset. But we need to take step back in these situations and ask ourselves ‘what thing happened that caused this incident?’ A better monitoring tool won’t expose the internal state of your service. A better alerting tool won’t make your alerts better. A new time series database won’t make you any more or less observable than you already are right now.

This ‘old way’ of thinking — the emphasis on tooling — fit the problems that we had at the time. But the TransferWise platform has matured to meet the needs of our customers, and our engineering culture needs to adjust and adapt our processes to meet the new challenges this complexity gives us. We need to recognise that every bumpy release made — without the visibility of why or how it went wrong — could hurt our customers.

This isn’t to say that we should move to an ITIL-like change management process to reduce the risks associated with software engineering. It’s instead about team leads making Observability a priority alongside their feature development. It is about engineers calling out their colleagues when implementing new functionality or fixing something old. It’s ultimately about creating a self-reinforcing feedback loop that makes new features observable from the get-go, rather than as something done at the tail end of development.

Conclusions

So now we know why we need to be observable, and how we intend to get there. In our next post, Becoming Observable, we’ll talk more about some of the steps that TransferWise engineering is taking to solve problems outlined.

P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.