Hey British Airways, Here’s Five Whys It Was Not The IT Person’s Fault!
Unless you’ve been living under a rock, you’ve probably heard about the massive British Airway’s Outage that took down their backend computer systems. This outage stranded 75,000 customers for over three days. If that wasn’t bad enough, now it looks like they’re placing the blame solely on the IT Person.
British Airways is doing what is typical of many tech companies; When a bad outage happens scapegoat the individual! Individuals are never the fault of an outage, the root cause is always the lack of a system, process or control. And now I’m going to tell you Five Whys.
If you really want to learn the root cause of a bad outage or situation, then I’m going to propose using The Five Whys. Wikipedia tells us that it is “is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem.”
I think my explanation is much simpler. We use recursion to determine the root cause in a situation by simply asking ourselves why. Ok, maybe that wasn’t simpler but the best way to describe this is by example. Let’s try to understand why I was late to work this morning. On each iteration, once we make a statement, we will then ask why against that statement and do that at least five times or until you get to the root cause.
- Problem: I was late to work this morning.
- Question: Why?
- Answer: Because my car did not start.
- Question: Why didn’t your car start?
- Answer: Because the battery was dead.
- Question: Why was the battery dead?
- Answer: Because the alternator did not recharge the battery.
- Question: Why didn’t the alternator recharge the battery?
- Answer: The alternator belt was broken.
- Question: Why was the alternator belt broken?
- Answer: It was old and had not been replaced.
- Question: Why wasn’t the old belt replaced?
- Answer: I did not take my car in for 100,000 mile maintenance.
- Root Cause: I was late to work this morning because I did not take my car in for regular scheduled maintenance.
My example iterates more than five times, but we still get to the root cause. If I had just performed regularly scheduled maintenance on my car, I would not have been late to work. In this case the root cause was my fault because I was lazy, but in a corporate environment with complicated distributed systems and lots of people involved it will never ever be one person’s fault.
Now let’s try this on the British Airways. I don’t have all the details and their use case may be much more complicated (and involve multiple branches). We’re going to try anyways to show why we all should stop pointing the finger of blame at people and instead look at the entire system and continuously question ourselves to discover our true root cause.
Problem: 75,000 customers were stranded for 3 days
- Question: Why weren’t we able to route passengers, baggage, or launch flights for three days?
- Answer: The core mainframe was down and we could not lookup any passenger, baggage, or flight information.
- Question: Why was the core mainframe down?
- Answer: It lost battery backup power during a power outage because of operator error.
The battery backup issue would be its own branch. I’m suspicious if this even happened but if it did it should not be possible to turn off a UPS to critical components when there is no “shore” power. That said, we’ll just focus on what happened after the power outage.
Notice, blaming things on a single operator is not enough. We need to dive deeper.
- Question: The power outage only last 15 minutes. Why didn’t the system return to normal state after power had been restored?
- Answer #1: All the machine and mainframe components all turned on at the same time. This caused a huge power surge which drew too much power. This caused several machines to rapidly power-cycle causing damage to several components.
- Question: Why wasn’t there a documented procedure detailing how core systems should be restarted from scratch to avoid drawing too much power?
- Takeaway #1: There needs to be a documented procedure for restarting the system from a cold-start to avoid drawing too much power at once and overloading circuits.
- Answer #2: The system did not immediately return to a normal state because some key services came online before their dependencies did and immediately error’d out. It took lots of debugging time to discover why services weren’t immediately working and fix the errors.
- Question: Why did key services come online before their dependencies?
- Answer: No one ever restarted the datacenter from scratch before. We did not really know in what order to turn services on or what our dependencies were.
- Takeaway #2: There needs to be a dependency map for all services. This should be combined with Takeaway #1 so that hardware and services are restarted in the right sequence.
I’m going to stop here but there are probably lots of other branches that we could keep going down and to be honest I do not have all of the details. When we are trying to resolve complex IT issues and understand them, we always need to go deeper. Ask questions past the point of this engineer or operator did X or Y to what happens after that. Don’t stop until you get to the system or process that was not in place or did not exist.
One of the reasons why air travel is so safe is because the NTSB does some of the best root cause analysis I have ever seen. As a result, I love reading reports and watching documentaries of why planes crash. I suspect that in the British Airways Incident, the root cause was probably several things:
- No dependency maps for both hardware and services and documentation on how to bring up your datacenter from a cold-start.
- No secondary datacenter. If the loss of your datacenter can result in a 100 million dollar outage, you need a secondary datacenter that is always online and ready to take over. If you have one and it was not used in this incident, you need to Root Cause Why it did not work.
- No testing of data backup and restore procedures. I’m positive that there were some data consistency/loss issues if all of the machines truly did lose power.
- Not enough training and practice. It is pointless to have a procedure and never actually practice it. Humans learn through repetition.
The big takeaway here however and what my gut tells me is that they did not how to restore the system from a cold-start. This is a scenario in which everything is off and you need to turn it all back on again. What’s scary is that this is probably true for most companies and we should all ask ourselves what we’d do if our datacenter was in the same situation.
It’d probably takes us all days to get everything working correctly again too. It is just something that most companies never do. We can all probably learn from this situation.
I’ll be waiting with bells on for the true Root Cause Analysis.