The importance of observability

Steve Vaughan
MPB Tech
Published in
6 min readOct 4, 2022
Photo by Galen Crout on Unsplash

What is observability? And why does it matter?

Consider the following question: “Something’s not right on the platform. Our customers can’t log in. What’s going on?!”

Naturally, we want to be able to answer quickly and accurately. And, ultimately, fix whatever is wrong—as soon as we can. What we don’t want is to be left scratching our heads, providing multiple “we’re looking into it” updates.

Enter observability, which the provider Elastic defines as deriving usable insights from complex distributed systems.

To put that another way, it’s about making full use of the tools at our disposal to help us find out what’s really happening:

  • Our services: logs, traces, resource usage and more can help us determine the system’s health
  • End-users: we can look at session activity and on-page behaviour to understand patterns and issues.

Observability matters beyond simply answering the immediate question. It broadens access to the answer, empowering teams and organisations to move away from knowledge silos and reliance on subject-matter experts.

At its best, observability might even avoid the need to ask the question in the first place.

The impact of observability

To understand the wider impact observability can have, we need to frame it with something tangible. Let’s use bugs in software as our example.

Furthermore, we’re going to split those software bugs into two categories, good bugs and bad bugs.

I know you might be thinking, “there’s no such thing as a good bug”, but bear with me …

Good bugs

These are the ones any engineer dreams of working on. They’re the bugs that fit the following four criteria:

  1. Easy to replicate. These bugs happen every time, with no complicated steps to reproduce.
  2. Easy to gauge impact. They affect a known number of users or workflows, workarounds are understood, etc.
  3. Easy to triage. Due to the two previous points, it’s easy to prioritise these types of bugs.
  4. Easy to validate as fixed. Because the issue was easy to replicate, the solution is easy to validate too.

Bad bugs

Let’s face it, we’d all be over the moon if the majority of software bugs were the good ones. But, more often than not, they’re the ones that make for a bad day at the WFH/office. A bad bug fits the following criteria:

  1. Hard to replicate. These bugs are intermittent or only impact a niche subset of users, potentially never reproduced.
  2. Hard to gauge impact. They affect an unknown number of users or workflows — the issue might sound bad, but it’s hard to know for sure.
  3. Hard to triage. Due to the two previous points, it’s much harder to prioritise these types of bugs.
  4. Hard to validate as fixed. Because the issue was hard to replicate, it’s also hard to validate any solution.

The lifecycle of a bug

With our two polar-opposite bug types laid out, let’s consider the lifecycle of a bug in software so we can see the potential impact of observability.

Good observability allows us to identify the impact of a reported issue much sooner. That directly informs prioritisation and can improve our ability to recreate the issue, often providing much-needed context and information about what’s going wrong.

When observing the bug, the criteria used can then be reused to validate the success of a fix. This can be particularly useful for the type of issue that is difficult to reproduce, and where fixes can sometimes be speculative.

Perhaps most importantly, though, the same approach to discovering the impact of a bug and validating the fix can also be used to monitor the issue and ensure that the bug doesn’t start happening again. Sometimes this is already covered by unit, integration or end-to-end testing, but being able to observe and monitor a production system is invaluable regardless.

All this is not to say there won’t still be times where good observability falls short of providing answers; But for new and exceptionally novel errors, the chances are that very little will help us achieve a quick diagnosis.

Observability toolbox

At MPB, we use an evolving suite of tools to observe our platform. It’s critical to continuously adapt to close gaps in observability as our platforms and software evolve.

Elastic

We make extensive use of the ELK Stack here at MPB. We heavily use the Filebeat and APM capabilities, but we’re continuously growing our use of the ELK Stack.

Elastic offers us a couple of notable areas for observability. We use the Discover area of Elastic to monitor log statements from our various services that run within Kubernetes, allowing us to understand the prevalence of an issue. Secondly, there’s APM, which has allowed us to see exactly how our platform and services are talking to each other, providing high levels of detail about what goes on under the hood.

A notable benefit of APM is that it lets us capture errors that happen in a browser environment, something we’d normally be entirely blind to. This is critical when it comes to observing the health of a browser-based platform.

Prometheus

When it comes to monitoring known interactions between our platform and services, Prometheus gives a fantastic insight into how the Kubernetes infrastructure is performing. We also use the Prometheus Push Gateway to leverage a more event-driven approach for monitoring the health of certain hard-to-observe areas — for example, the success rate of our platform’s communications with a payment gateway.

We use Prometheus as an alert system when errors start to spike and it’s a key part of our production support offering.

Cloudflare

Sometimes an issue originates outside our infrastructure. Cloudflare gives us the ability to understand this traffic, allowing us to see trends around automated traffic vs real users, spikes in activity at different times of day, potential attacks and more.

There are many other tools we use for observability, but the above three are our ‘daily drivers’.

Building for observability

While the available tools are useful, they don’t necessarily provide their full benefits ‘out-of-the-box’.

Often, we need to change the way we think about the software we’re building in order to allow it to be easily observed.

This can take many forms: from the way the software is architected and the technology that’s used, to the way in which a particular feature or capability might need to be observed.

When thinking about how to build for good observability, it can help to ask the following questions:

  • What information will we need in order to know whether the thing we’re building is working correctly? Is there a way we can assert good performance—for example, a success-vs-failure ratio? Is there a way to understand whether the behaviour of something changes once it’s out in the wild?
  • How can the information we provide help, and what might be noise? It can help to add logging, but too much can add noise. Will exceptions be easy to understand? What happens if there are asynchronous processes? Can we still be clear on what is going on, or will we lose visibility?
  • When should we notify? It’s tempting to notify when something goes wrong, but it’s equally useful to notify when it’s successful. This allows the ratio of successes to be understood, which in turn allows those thresholds to scale with the platform. This is great when considering things like Service Level Objectives.

Regardless of whether it’s a backend service, a user interface, a command-line tool or an automated pipeline, defining its behaviour—and how to observe it—is always important.

A key aspect of building for observability, we’ve learned from experience, is to decouple the code required to do the observing as much as possible from the rest of the application. The last thing we want is for the code responsible for observability to fail to run because of an issue in the application.

Final observations

We’ve looked at what observability is within software, why it matters and the impact it can have, as well as some of the tools we use here at MPB. We’ve also covered some areas to think about when building for good observability.

While we’ve focused heavily on software and bugs, it doesn’t have to just be about understanding issues. Observability can provide the basis for discovering areas of focus, opportunities to improve performance and far more.

Whether it’s new software or a change to an existing codebase, improving observability is always possible. At MPB it’s something we’re wholeheartedly embracing.

Steve Vaughan is an Engineering Manager at MPB, the world’s largest platform to buy, sell and trade used photography and videography kit. https://www.mpb.com

MPB is now hiring. Apply now at careers.mpb.com.

--

--