The 3 Elements of Reliability Engineering

Don’t let your legacy systems ground your operations

Jeroen Heijmans
Software Improvement Group
5 min readAug 15, 2016

--

On August 8th, Delta Airlines suffered a major computer outage, causing 1,800 flights to be cancelled and many more delayed. This is, once again, proof that software has become part of society’s DNA: we depend on it for every aspect of our lives. And if that software breaks down, some aspect of society also breaks down, as many thousands of passengers have experienced.

The root cause of the outage was a brief power interruption and subsequent failure in the hardware that was supposed to switch over systems to backup power. As usual, such an incident leads to the usual snarky “this could have been prevented comments,” ranging from “They probably used up all the fuel doing weekly backup generator tests” to suggestions to download Netflix’ Chaos Monkey, a tool to introduce random failures into a system to test reliability. On the other end of the spectrum, these events are seen as inevitable. Airline industry expert Seth Kaplan on NPR, for example, reasoned that “airlines aren’t like other businesses”, which in turn has led to a uniquely complex IT landscape in which some part is bound to fail at some point.

As usual, the truth is somewhere in the middle.

The 1st Element of Reliability Engineering: Prevention

One important element of reliability engineering is attempting to prevent this kind of problem from occurring, by building a mature system. This means well designed and written software, the right type of hardware and solid operational and testing procedures (including, indeed, for example Chaos Monkey). But this alone can never guarantee reliability. No matter how often it’s tested, a hardware component can still malfunction as it did at Delta, perhaps due to wear or outside circumstances.

Software is also notoriously hard to get right: even in the absence of outright bugs, unexpected circumstances can still lead to incorrect outcomes. And yes, the backup generator refueler might have taken a day off without telling his colleagues to take over his tasks. In that sense, Kaplan is right: something is bound to go wrong, particularly in complex landscapes.

The 2nd Element of Reliability Engineering: Fault Tolerance

But fortunately, one such error doesn’t have to lead to flights being cancelled. That’s where fault tolerance comes in. A system that needs to be reliable should be designed so it can deal with occurring faults. This typically consists of two aspects: redundancy and isolation. Consider an airplane: despite all the effort going into designing a reliable engine, it is still possible that an engine will catch fire, for example due to a flock of birds being sucked into it. Most commercial airliners are equipped with two or four engines (redundancy), so that one burning engine will not bring the plane down. As for isolation, the plane is designed in such a way that the fire cannot (easily) spread to the other engines and other parts of the plane: fuel lines can be cut, electronics are compartmentalized and of course fire extinguishers and retardants are used.

Clearly, Delta Airlines is aware of this approach. The fact they employed backup power is a good example of building in redundancy, even if it failed in this case. It seems, however, that isolation was not taken care of sufficiently. Per Delta’s COO Gil West, “the systems that failed to switch over suffered from instability, affecting the performance of a customer service system used to process check-ins, conduct boarding, and dispatch aircraft.”

This is a problem that is both common and preventable.

One typical software solution for this issue is the “circuit breaker”, a concept popularized by Michael Nygard. Similar to its electric equivalent, it detects if a dependent system is unavailable or unreliable, and “cuts” the connection if that is the case. Of course, merely detecting a problem with a dependent system isn’t sufficient, but it is a first and necessary step and can at least prevent the aforementioned performance problems.

The 3rd Element of Reliability Engineering: Recovery

It’s possible that, despite all your efforts, a complete system failure occurs. Delta CEO Bastian says this is what happened: “it caused our entire system effectively to crash and we had to reboot and start the operation up from scratch”. At Delta, initial cancellations meant flight crews and planes were displaced, which in turn led to more delays and cancellations. West indicated this was hard to revert as re-planning all these flights was very time-consuming.

Recovery is often time consuming (particularly in case of disasters), so a common approach is to have a failover system available. This is a second installation of your system on a different location. The failover system can be on stand-by to take over as soon as possible, but often some switch-over period is required. Per Kaplan, Delta (and other airlines) did not deploy their systems at multiple locations as this would make them more vulnerable to hackers or terrorists. I am not sure if this reasoning is correct, but from a risk analysis perspective it’s at least an unusual choice. Having multiple hosting locations indeed means there is more opportunity to attack, but having just a single location will increase the impact of a successful attack as it will take down everything.

Of course failover locations aren’t the only measure you can take to improve recoverability. One that is often overlooked but can be very effective is having a failover system available. Having a completely different system sounds expensive, but doesn’t have to be. I’ve seen cases where employees were trained on handling their tasks on paper as well as using the IT system, meaning they could fall back to the manual process when needed. It’s hard for me to tell if something like this would have worked at Delta Airlines, but it’s just one of many options.

In conclusion, the good news is that in contrast to Kaplan, system failures like Delta’s are not systemic and can be prevented. It’s not easy, but following these Three Elements of Reliability Engineering will materially lower both the probability and the impact of a risk like Delta’s August 8th categoric failure.

About SIG

Limited Visibility: A stark competitive disadvantage

In a technology environment that becomes more complex with each new development in big data, machine learning, and the Internet of Things, companies cannot securely transition away from legacy systems, achieve digital transformation, strategic innovation, or even steady productivity if they’re not sure how their critical software is performing.

Visibility. Remedy. Prevention.

At SIG, we are committed to elevating technology leaders to the level of deep code visibility they require, giving them the roadmap to remedy current problems and prevent future vulnerabilities.

--

--