Where’s Wald?

During World War 2, the Allies were dealing with problem: increasing the survivorship of airplanes using armor. A completely armored plane would be too heavy to fly useful missions. The question was how to apply enough armor to get the airplanes back while not reducing their mission capacity. Although the survival of a given plane might have much to do with the skill of the crew, the weather, or some other specific factors, the sheer number of aircraft and the repetitive nature of their losses implied that statistical analysis could guide the placement of armor. The prevailing idea was to apply statistics to analyze the bullet hole locations and then apply armor in that pattern. Although they tried different forms of armor the aircraft losses continued.

Abraham Wald (1902–1950), a Romainian Jewish mathematician who emigrated to the U.S. to escape persecution just before WWII, looked at the problem came up with a simple idea.


Instead of looking where the bullet holes were, they should be looking at where they weren’t. The surviving planes must not have been hit in vital areas — they had survived to return and be examined — and the absence of holes would point out where those vital areas were. Wald’s contribution was to recognize that the bullet holes they could see were in returning planes were the not lethal ones. The statistics demonstrated that holes were more or less evenly spread over the surviving aircraft with the exception of the engines. Wald asked the question, “where are the missing holes?” and this led to the realization that armor should be added to the area where there were no bullet holes in the surviving planes. It was the engines that were vital. Reminds me of the old statistician joke…

“How many statisticians does it take to change a light bulb? None. They can’t do it, but they can prove that it can be done.”

One of Wald’s strengths was his mathematical background. He lived and breathed in a world of abstractions. Wald looked at the planes as mechanical upholstery, an abstraction of guns, metal, and struts. Excuse my pun, but this allowed him to be removed from the nuts and bolts that other analysts were trying to understand but couldn’t see the simple answer. Another form of this we see in common day scenarios is the expression thinking outside the box. The “out of the box” thinking allows us to see the problem from a different viewpoint — an abstraction. The phenomenon Wald pointed out is known as a survivorship bias.

Survivorship bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias — Wikipedia

In IT, our selection process is often based on looking at negatives, typically, in the form of failures like How many outages, transaction failures, missed SLA’S? We apply tremendous amounts of telemetry around failure. It’s human nature to observe failure and to try and control it. We think we can make it better if we can identify who or what broke it. Sidney Dekker, a leading researcher in the area of safety and resilience engineering, says that, in the safety field, they measure things by the absence of negative events — No injuries in the last ninety days. He suggests that this is the wrong way to look at this, and they need to do safety differently. It’s not about what failed; it’s about people’s success in not failing or, even better, succeeding. This counterintuitive idea is that things sometimes fail because most things actually succeed. We have to find out how people succeed. We need to stop measuring the quality of work by the absence of negatives and start seeing it as the presence of adaptive capacity.

In IT, we have the same problem. When we see one failure out of hundred, we tend to rabbit hole the one failure instead of trying to understand the other ninety nine that went right. I recently had an opportunity to have a discussion with John Allspaw (Former CTO of Etsy) and Richard Cook (author of “Why Complex Systems Fail” paper) about the difference between Lean and Safety. We all agreed that IT is a complex adaptive field. One of the things that Cook says he does to understand complexity in patient care is that he looks for smaller functional teams that are coping with complexity, and tries to understand how their coping strategies. How do they learn to steer the complex system and keep it away from the sharp boundaries, the so-called cliffs? He looks for the things they get right and looks for ways to facilitate their efforts so as to enhance their capacity to cope with complexity and chaos. You could say he is not getting lost in survivorship bias by telling them how to fix things. Dekker would say, is looking for the presence of positive capacity.

Finally, I want to tell you about a story I heard recently from Dr. Dekker. I had the privilege of spending a day with him at Griffith University in Brisbane, Australia. After lunch, he invited me to participate in a lecture he was giving to some of his graduate and PHD students. Dr. Dekker often refers to the Abraham Wald story in his lectures. In fact, I first heard it from him in 2013 at the Devops Days, Brisbane where he and I were both keynote speakers. In this lecture, he told the Wald WWII story again. However, a few slides later, he showed a picture of a guy in an outdoor field with a hard hat on with a sticker that said GSD. Dr. Dekker explained that the GSD sticker was on his hard hat so that everyone knew he was the guy who “Got Shit Done”. Dr. Dekker told the class that the guys name was Bob. Then he asked everyone in the class how is Bob related to the Abraham Wald story. I didn’t feel too bad when I couldn’t come up with an answer because the other attendees in the lecture were post-doc and master thesis candidates studying under Dr. Dekker. and they didn’t get it either. The answer was that Bob “Got Shit Done” in a place where it was impossible to get things done.

In other words other people continuously fail to get things done. It was a chemical plant that was wrought with dysfunction and uncontrolled misunderstood complexity. Bob “got shit done” in this crazy environment. Do you have a Bob in your organization? Just like Abraham Wald’s observation about the bullet holes, we need to look at our Bobs more closely. How the hell do they get things done? So in our own environments, my advice to you is play the “Where’s Wald” game and try to find the missing holes. Find people who, in the worst case situations, always seem to consistently get things done where others can’t. If you can follow this individual’s behaviors, you might be able the enable others to “Get Shit Done”.

Update: Tom Hall @tmhall99 sent me an interesting note about this post. He suggested that Brent in Gene Kim’s Phoenix Project would have been a great candidate as someone to follow to figure out how he continually “Got Shit Done.”

Want to add a special thinks to Richard I. Cook, MD for his review and input on this post.
Like what you read? Give John Willis a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.