Troubleshooting: First, Do No Harm

Paul Smith
EverestEngineering
Published in
4 min readJul 4, 2023

--

Things are not looking good. The Help Desk team is flooded with calls demanding to know when the problem will be fixed. Your boss is breathing down your neck, and your engineering team is looking to you for an answer. What do we do? You have no idea. But the urge to do something, anything, is overwhelming.

Photo by Tonik on Unsplash

In my career, I have had the "luxury" of participating in many production incidents. Far too often, joining the incident calls or Slack channels, I get a sense I've entered a chicken coop invaded by a fox. There's a real sense of panic going on.

I've seen bright, talented people rushing, ready to try almost anything to fix the problem, usually with NO idea what they think will happen when they try it.

The ̶o̶l̶d̶ wise engineer knows from experience that this is a dangerous time. When bad decisions might have far-reaching consequences and a future feeling of regret. This is where I try to channel Gene Kranz of Apollo 13 fame and pull out his familiar quote for the incident team:

Let's not make the problem worse by guessing.

I have seen people make rash decisions without using a calm, methodical approach. Panic-induced, knee-jerk reactions can result in painful consequences. One wrong move can be devastating. At best, you’re wasting your time.

We are engineers. We do not gamble and roll the dice.

First, Do No Harm. It is excellent advice with proven history within the medical profession to back it up. Each action taken to correct or mitigate a problem needs a rational basis for taking it, but more importantly, a good idea of what you expect will happen. An understanding of the potential consequences of an action needs to be front of mind. You need to balance your actions against the risk of making things worse.

If you're not confident you understand the problem or what your change will do, my advice is not to do anything.

So, what SHOULD you do?

Rather than knee-jerk reactions, you need to get organized. Use another technique from the medical profession: Differential Diagnosis. Dan Slimmons has many great posts/talks on this genre.

Here's an outline of Dan's process I've used many times. It's not my idea, but it's too good not to share. It'll help you make informed decisions.

  • Create a Trello-like board — if you can, it's an excellent tool for ad-hoc incident tracking.
  • Create four columns: Symptoms, Hypothesis, Tests, Done

Symptoms

List out the known symptoms or signals you believe are relevant. This area is fantastic for new people joining the incident. Think of it like a "Medical History." No need to be distracted from bringing new people up to speed. Just point them here to read up on the current state. You and others can continue to focus on the problem.

Include external facing symptoms as well as anything internal you've discovered. You can consider this analogous to noting patient-reported feelings with concrete physiological observations.

e.g.

  • Many customers report errors. Inconsistent, it works sometimes.
  • Unusually low User CPU level on the load balancer

Hypothesis

List out a series of hypotheses on what the problem might be. Nothing is too crazy to list here. Be thoughtful, but get creative. Look for ideas on why a hypothesis might produce the symptoms. This is Idea Generation time.

e.g.

  • The Apache httpd is in a restart loop. It looks up, but it's never up long enough to be useful.

You can prioritize these based on your own experiences, judgment, or feeling in your joints.

Tests

Based on the set of hypotheses generated, look for tests to help you rule out one or more hypotheses. Tests are concrete actions that provide evidence to rule something out. It's essential to structure the test to help you rule things out, not in.

e.g.

  • Use Sysdig to trace existing processes that are exiting. If httpd isn't in the list, it's not the restart loop

When you only have one hypothesis left, not ruled out, you’re narrowed closer to understanding the problem. When you know the problem or are very confident, your decisions are informed and not guesses.

If you have no more tests and you've ruled out all your hypotheses, go back to Idea Generation time (perhaps also the coffee machine).

Done

This column holds all the cards/tasks/hypotheses you've tried and completed. Record results of what happened and link tests that ruled out a hypothesis. This log area also helps the new people who've joined by outlining what you've already tried and, again, preventing you from being distracted by those joining the incident.

By taking a calm, methodical approach and “working the problem,” you're more likely to reach a positive outcome and gain a deeper understanding of what happened. The Trello board is a great asset for any post-incident review.

But first, do no harm. Don’t guess. Don't do something for the sake of doing something. Be decisive in your choice because you have the information to back it up.

--

--

Paul Smith
EverestEngineering

Software Architect, Technical Leader, Troubleshooter and Story Teller