Great article, Mathias!
A note about the paradoxical flavor of the last item (18) “Failure free operations require experience with failure”.
The paradox is that by striving after high-quality performance we can often reduce the frequency of overt, catastrophic failure to quite low levels — so low that it may appear that the systems are failure free but this encourages people to pursue other goals that then reduce safety.
After a long period without a major failure some people will conclude that their system is ‘safe’. It then seems that there are opportunities for cost/effort savings and production pressure (which is everywhere and all the time!) leads to actions that generate new exposures to failure [e.g. seeking higher production with the same resources].
Of course, after the ‘big one’ it is easy to see how the decisions that favored production were unwise. I doubt that anyone in Delta Airlines today regards the 2014 decision to bring its IT processing entirely in-house as an entirely good idea! The difficulty is appreciate the reduction in safety in advance.
Our studies of expert practice in high-consequence domains reveal that experienced, thoughtful practitioners nearly always have some direct experience with catastrophic events and learn a great deal from them. These people are a great resource in any organization but, because they are often fairly low in the chain of command, it can be hard to draw on that experience. Perhaps the most important observation by Rochlin, LaPorte, and Roberts in their paper on aircraft carrier operations is that the US Navy works deliberately and quite hard to build and maintain cadre of experts with the necessary knowledge of how things work — in this case the Chief Petty Officer rank people on the carrier.
Perhaps the greatest challenge in the IT world today is generating, connecting, and sustaining such a cadre. Wherever I look in IT I find groups struggling with this and I believe our future depends on success here.
Again, thank you for the nice shout-out. If you will allow me a plug for my own work: our SNAFUcatchers project is examining this and other issues in collaboration with some big and small internet-facing businesses. We hope to have results to present in a few months and will discuss the work at the next Velocity meeting in New York. I’d be happy to talk more about How Complex Systems Fail with anyone at that meeting — please say hi if you are there!