You’ve heard of the 5 whys, but what about the “But 4s”? — Use this to think like a staff software engineer.

Willy Xiao
5 min readJul 31, 2023

--

Most software engineers who have gone through a post-mortem process have heard of the famous “5 whys”.

The principle goes: If something went wrong, ask yourself “why” 5 times to understand the root cause of a problem.

One problem I’ve seen when this is applied…

One common problem I’ve seen from engineers when applying the 5 whys in Root Cause Analyses (RCAs) is that they often only apply it only as a way to “drill-down” on one specific reason.

So, for instance, the saying goes…let’s ask ourselves “why” 5 times:

  • Production Outage: the user couldn’t schedule themselves to come into the office on our mobile app.
  • Why? Because the API server was down.
  • Why? Because the last deploy to production had code that threw an exception in run-time.
  • Why? Because there was a bug in line 37 of file XYZ etc.,
  • Why? Because there wasn’t a test.
  • Why? There’s no culture of writing tests on our team.

The problem with this is that it ignores that there are *multiple* causes for any given problem.

In the example explained above, the API server was down because the last deploy to production had code that threw an exception. While that is a valid cause, it is not itself sufficient to cause the outage.

But for any number of other issues including:

  • Our servers don’t attempt a restart when they go down.
  • We don’t have backup servers when one goes down.
  • We don’t have global exception handlers and “error boundaries” in our API server.

The server also would not have gone down.

Of course, more comprehensive explanations of the 5 whys (like this) acknowledge the need to have “multiple lanes,” but in practice, what I see is that when post-mortems often ask the driver to “Do the 5 whys”, they often result in lackluster, not exhaustive root cause analyses.

My Recommendation

So, to make this clear, I’ve started telling engineers to use the: “But 4, 5 whys” when doing post-mortems.

At each layer of the “why” that you’re asking — you should try to generate a set of reasons that follow the rule: “But for X cause, we would not have experienced Y problem.”

Philosophically, there’s not one cause for any effect in the world, there’s a long list of conditions that make an outcome possible.

You might’ve slipped on the banana peel because the banana peel was there, but it was also because you weren’t looking where you were stepping. And you weren’t looking where you were stepping because you were on TikTok. But for any of these (and many more!) reasons, you wouldn’t have slipped.

So, instead of your root causes looking like a single list, they should look like a tree in both breadth (but 4s) and depth (5 whys).

Example of “But 4, 5 Whys”

What happens when you do Root Cause Analysis well?

  • You start thinking like a senior / staff / higher-level engineer. Getting really tight about RCAs allows you to become a much deeper thinker about the problems surrounding your team, the organization, and the company. You will start to notice patterns about technical choices, processes, and architecture — which becomes your “hit list” of improvements you can make at your company. You will also develop a perspective on problems that are not immediatley in your control, this is natural to reflect on as you think at a higher level.
  • You learn to do this intuitively, really fast, and all the time for anything that doesn’t match expectations. I’ve noticed engineers going through this becoming much more natural at noticing these things throughout not just outages in production, but even for things like planning retros, team process issues, or even support tickets for infra/platform teams. You generally are learning to learn. Running solid retro processes is critical to learning, especially if you want to be at the bleeding edge of any field or when creating something new — no one has written the playbook for you so you have to reflect on your experiences.
  • Use this to develop technical opinions and to align others towards your views. If your technical convictions and opinions are formed through multiple real-life examples that are relevant to your team and company — they are going to be more grounded. It will also allow you to more easily align others towards your opinions by giving you a data-driven approach to pointing out issues. Creating a culture of strong RCAs gives you a platform for discussing those decisions.

And…here are a few helpful notes to remember about great root-cause analysis in general.

  • Have a blameless / faultless culture around post-mortems. If you do this well, people will feel comfortable saying: “I did X at Y time, which caused Z” without hesitation.
  • Use bullet-points and succinct statements to describe each cause.
  • Actually figure out what the causes are precisely. If you find yourself not confident about a cause, or if you’re “murky” on the details — that’s a sign to dig deeper.
  • At every level of the tree, you should use the “but 4s”, so not just at the root.
  • Your “but 4” reasons should be relevant. One common critique of the “but for” philosophically is that there are infinite reasons why something occurred. E.g. “because the sun came up today” is a reason why the user decided to go into the office. Focus on the ones that matter, which also means you have to have an inkling of what “good” practices and “bad” practices are. The “bad” practices are your “but for” causes.
  • Good RCAs result in a long-list of “ideals” that give you a roadmap to improve your technical architecture at your company. Each “problem” can result in a potential action item, but not all of them have to. Not every “problem” is worth solving. Knowing that they won’t all become action items actually allows you to have more confidence to explore all possible solutions more.

--

--