Single Point Of Failure
Lessons from Boeing 737 MAX and Fukushima Daiichi Nuclear Accident
The story of the success of each company or person is different and difficult to replicate. The story of failures, however, have many common elements and sadly very easy to replicate.
Many would have read the account of the Boeing 737 MAX accidents and subsequent analysis of the problem. There are now hundreds of articles examining the different aspects -relying on a single Angle of Attack -AoA sensors which were not reliable, the much-maligned MCAS SW system that relied only on the one existing AoA sensor, and which was designed to compensate for the bad flight dynamics caused by the bigger ‘better engines’ latched on to the older frame, to reduce cost and time, and then a slew of small but important misses that undermined the many fail-safes built into complex systems; for example, the MCAS was not advertised, neigh not even documented in the heavily documented flight manuals; as they had a similar system in their fighter planes before and it was not considered really important information to the pilots.
The Boeing 737 MAX Saga: Lessons for Software Organizations https://embeddedartistry.com/wp-content/uploads/2019/09/the-boeing-737-max-saga-lessons-for-software-organizations.pdf is an excellent read that tries to draw the lessons out from this to Software Organisations
The usual suspects, rushed deliveries, bad design decisions, and business strategy all come out. Along with nuggets of wisdom — like to improve something, it is not enough to remove things that you do not want but to go for things that you want.
Fukushima Daiichi Nuclear Accident
Let’s check another such disaster and see if there are common threads. Here is an excerpt from the Fukushima Daiichi Nuclear Accident
https://www.ncbi.nlm.nih.gov/books/NBK253938/. This was a very complex root cause analysis to do as you can see in the above link.
The accident at the Fukushima Daiichi nuclear plant was initiated by the March 11, 2011, Great East Japan Earthquake and tsunami. The earthquake knocked out offsite AC power to the plant and the tsunami inundated portions of the plant site. Flooding of critical plant equipment resulted in the extended loss of onsite AC and DC power with the consequent loss of reactor monitoring, control, and cooling functions in multiple units. Three reactors sustained severe core damage (Units 1, 2, and 3); three reactor buildings were damaged by hydrogen explosions (Units 1, 3, and 4); and offsite releases of radioactive materials contaminated land in Fukushima and several neighbouring prefectures. The accident prompted widespread evacuations of local populations and distress of the Japanese citizenry, large economic losses, and the eventual shutdown of all nuclear power plants in Japan -https://www.ncbi.nlm.nih.gov/books/NBK253938/
While the plant was designed to fail safely with regards to the earthquake, there were not good fail-safes for Tsunami in place (which was triggered by the massive earthquake). In hindsight, this looks like a clear oversight of the system design; but in reality, it is very hard to think through all the complex interactions that can cause failures and build the perfect system.
Here the events were triggered by an earthquake and tsunami that knocked out many systems. So many systems failed and were on fail-safes that it overwhelmed the operators. The most damage was done by flooding due to the Tsunami, which affected a lot of backup generators that were the only power source after the transmission tower was downed by the earthquake.
However, the disaster was classified as man-made, as there were a set of fail-safes and all sadly failed, and finally, there was a single sensor which indicated the cooling system working and that was not working due to a human error in not correcting it in time.
The two systems are in completely different domains. The only characteristics are the complexity of the systems, and the seemingly long safety record of these systems, that gave a false sense of safety.
Like many, I have been fascinated by the discussions and articles related to the Boeing system design failure. Looking into our system we took the learnings to list out Single Point of Failures in the system.
It was actually surprising to see the number of systems that were SPO Failures. Here is a chart that depicts in rough. Listing SPOF’s is the first step, as surprisingly many things go un-noticed especially as systems evolve and new components get added over time.
Also, there is this myth that many systems are highly available defacto. Take the example of a Kubernetes cluster with HA configuration of multiple masters and multiple workers.
If these masters and workers are based on a virtual machine like OpenStack, then it could be that a bare metal server on which all masters are hosted could be a single point of failure. Or if you are using a storage system like Rook-Ceph to provide persistence layers to your application in the cluster, then if the Rook-Ceph stack gets affected due to an upgrade gone bad, or wrong configuration, the whole cluster becomes useless to applications. Or say that everything is redundant but you have only Docker registry and that had an outage or one Jenkins server or … the list goes on.
This may be just one of the simplest lesson to take away from the failures of supposedly indestructible systems.
Sometimes many high available systems may be connected to a master system, usually, one that handles user authentication. Even with our SPOF we missed one such interaction. Just goes on to show that there is a deeper design patter here to be uncovered and addressed during system design. And till then workarounds like SPOF and Checklist helps.