The Importance of Mental Models for Incident Response

Jack Vanlightly
Splunk-MaaS
Published in
4 min readOct 18, 2021

Being a member of the Splunk Messaging-as-a-Service (MaaS) team means that you spend your time working directly on Apache Pulsar and Apache BookKeeper projects, CI/CD, automation tooling and finally, incidents. Being on-call is one of the most challenging aspects of the job as incidents can come at any time and the contributing factors can be diverse.

With a messaging system like Apache Pulsar one typical incident that can get an engineer paged is that a message backlog has occurred and is either not getting drained or is not draining fast enough. You as the engineer are called in to fix the problem and you’ve got to figure out what is going on. The problem is so high level the underlying cause or causes could be anywhere in the system. So where do you start?

It starts with your mental model of the system, the more complete and more accurate the model the faster you are likely to be able to diagnose and remediate the issue. It’s both a map but also an understanding of how components interact that allows you to infer how signals in one part of the system can explain behaviour seen in another.

Without a mental model, metrics can be meaningless numbers and logs can lead you down lines of investigation that are totally fruitless. Your lines of investigation are limited because you don’t have a fundamental understandings of how things work beneath the external APIs.

Having a mental model of the system is a necessity for fast and effective incident response as is having the right signals available to you — those are the metrics and logs.

Metrics, logs and a good mental model

Good metrics and logs work hand-in-hand with a good mental model and help you build a picture of what is going on. Unfortunately there can be blindspots in those metrics and logs that leave parts of the mental model in the dark. Likewise you may have blindspots in your mental model that leave you ill-equipped to understand what certain metrics mean. These blindspots often are only noticed once engineers have had time in the field but the good thing about open source projects is that those blindspots can be fixed by the very engineers that need them.

Unfortunately Apache Pulsar and Apache BookKeeper have a few of these blindspots but work is in progress to fix that. The MaaS team currently have two work streams dedicated to making Apache Pulsar and Apache BookKeeper easier to troubleshoot during incidents. One is introducing structured logging to Pulsar (and later to BookKeeper), see PIP 89 and the other adding USE metrics to BookKeeper (and hopefully later to Pulsar), see BP-44.

The aim of BP-44 is to fill-in some of those blindspots but also focus the metrics not just on emitting metrics of each sub-component but telling us about the utilization and saturation on the read and write paths.

If publishers are being throttled we know that the Pulsar cluster cannot meet write demand. If topic backlogs are growing then we know that the Pulsar cluster cannot meet read demand. But where is the problem? Wouldn’t it be wonderful if you could take that mental model you’ve built up and overlay that with colours of green, amber and red which tells you which sub-components of the whole system are under-pressure or overloaded?

An example is the BookKeeper journal. Data can only flow into a BookKeeper node (bookie) as fast as the journal can write it. How do you know the current capacity level of the journal? Is it at 40% throughput capacity and you can still push it further or it is dangerously close to overload?

The USE Method

This is where the USE method comes in. It stands for Utilization, Saturation and Errors and is an effective strategy for diagnosing where the bottlenecks are in your system.

Not all metrics are useful for telling you about utilization and saturation. Taking the journal as an example, we know the current latency histogram and rates of disk writes and fsyncs, we also have histograms that show the size of these writes and also the total throughput. Do all these metrics tell you if the journal is saturated? Not really. Latency can be a good indicator but it won’t tell you if you’ve reached the point of overload. Using latency, you need to have a good idea of what normal is. Without knowing what’s normal is, it’s hard to use latency metrics effectively.

What we need are metrics that can tell us the if the journal is a current bottleneck or not and to what degree. If in our mental model we know that the journal writer is single-threaded and writes to disk synchronously, then we can simply measure the time the thread is busy — time utilization. If it gets its data submitted to it via a bounded in-memory queue then we can measure saturation by the size of that queue.

Rather than trying to interpret what the latency histogram means in terms of journal capacity, we should be able see metrics that directly tell us if the journal has reached capacity or not and if not, how much more load might it accept.

New Blog Series!

With all the above in mind, we are publishing a series of blog posts for BookKeeper operators and developers that aims to build that mental model and teach how to use the existing metrics to find bottlenecks in both the read and write paths.

We also will be contributing improvements to BookKeeper to fill-in the blindspots and to better expose utilization and saturation metrics and will write about how operators can use those new metrics to more quickly diagnose BookKeeper performance issues.

Apache BookKeeper internals series for building the mental model:

Apache BookKeeper metrics guide coming soon.

--

--