Why don’t planes crash?

In 2017, the Aviation Safety Network reported that there were zero recorded deaths in commercial passenger jets. Given that planes fly at 600mph, at 40,000ft and are filled with flammable fluids, this is a pretty amazing achievement. How does the aviation industry maintain such high levels of safety?

To understand why we have to travel back in time to 1940’s Chicago and begin to understand the mindset of the aviation industry. From there, we’ll journey forward and see how the industry reacts to setbacks and how it’s continuous improvement culture really works.

The Chicago Convention

The International Civil Aviation Organization (ICAO) was established at the Chicago Convention on International Civil Aviation. This took place at a dark time in human history; WW2 was still raging. Despite this 55 nations sent over 700 delegates to attend, often at great personal risk. The conference ran from the 1st November to the 7th December, and at the end the following declaration was signed:

“the future development of international civil aviation can greatly help to create and preserve friendship and understanding among the nations and peoples of the world, yet its abuse can become a threat to the general security; and it is desirable to avoid friction and to promote that co-operation between nations and peoples upon which the peace of the world depends;”​
The signing ceremony

This is a very grand statement, but that’s not what has set the airline industry up for safety. Instead it’s this article, Article 37 that establishes International Standards and Recommend Practices.

Article 37 — Recommended Practices

The most important practices were agreed to be standardized but even more importantly than that it established the process of investigating accidents and sharing that data openly and transparently.

In the Westrum model this is known as a Generative model. In this model, failure prompts inquiry, not justice. It’s the establishment of this culture that has helped airline safety improve throughout the years.

The Comet

Our journey begins in the 1950’s with the Comet Jetliner, built near Hatfield in the UK. This is a truly amazing plane and at the time it flew faster and higher than anything else. It beat Boeing’s 707 to be the first passenger plane to be powered by jet engines.

The first commercial Jetliner!

Unfortunately, within a year of entering service three planes were lost. All suffered catastrophic breakups and there seems to be no common cause. The fleet was grounded. Due to no root cause, the manufacturer (de Havilland) made approximately 60 distinct modifications to cover all possible causes of the accident (including adding control surfaces, extra reinforcing etc). The Civil Aviation Authority believed this made the plane safe and they were relaunched.

Comet failure

This proved to be a poor decision and just two weeks after re-entering service a similar failure occurred over Naples. Comet operations were suspended until the cause could be discovered.

Structural breakups were uncommon at the time and investigators went to extraordinary lengths to find the root cause.

The image below shows a pressure tank created around a Comet fuselage. This tank was filled with 200,000 litres of water to simulate the pressure of flight. The tank was drained, inspections occurred and this was repeated. It took 3060 pressure cycles before the fuselage failed. This provided the pivotal evidence that turned the direction of the crash investigation.

Simulating pressure for the Comet pla

Scientists found the unmistakable fingerprint of metal fatigue.

Metal fatigue in the corners of the windows is the most commonly attributed cause of this crash. The square windows causes stress to concentrate in this area and led to extra stress and eventually failure. Legend has it that this is why modern airplanes all have oval windows.

Comparing the results from simulated pressure (left) and wreckage (right)

But this isn’t quite the whole story. The airline industry is always striving for safety and future analysis came to a different conclusion. The root cause was not infact the square windows — it was the method of affixing them to the air frame. For square windows rivets were used whereas for the oval windows glue was used. This was the true cause of the failure of the windows.

My point here isn’t the details, it’s that relentless process of inquiry to find the detail about what goes wrong.

Grand Canyon

Our next story takes us to the 1950’s. Just after 9am from Los Angeles airport two planes left just 5 minutes apart, a Galaxy Super Constellation and a DC7.

Galaxy Super Constellation (bottom), DC7 (top)

As they departed LAX they were operating under what’s known as instrument flight rules. Under these rules, Air Traffic Control and the instruments on the plane provide the primary control mechanisms for the pilots.

Both planes were heading across the US, and as they neared Nevada they switched to what’s known as visual flight rules (VFR). In VFR flying, pilots are responsible for avoiding collisions. I’m not a pilot, so I find that scary but it’s a common method of flying.

Both pilots encountered clouds near the Grand Canyon. As they approached the clouds they applied the same rules. Climb to get better visibility. As they climbed through the clouds they collided. Both planes were fatally damaged and this was the first commercial accident in which more than 100 people died.

Snippet from crash report.

The investigation concluded the cause was depressingly simple. The planes simply did not see each other in time to avoid the collision. Clouds in the area restricted visibility and the planes collided. Sadly the technology existed at the time to solve this problem and the report pulled no punches. The

“insufficiency of en-route air traffic advisory information due to the inadequacy of facilities and lack of personnel in air traffic control”

As a result of this report, action was demanded in Congress and in 1958 Congress passed the Federal Aviation Act. This established a single agency controlling airspace across the US. It resulted in much more investment in both air-traffic control technology and the recruitment of more air traffic controllers. As a direct result of the establishment of this body, air collisions became less frequent.

United Airlines

United Airlines Flight 389 left New York’s La Guardia airport bound for Chicago. As the plane made it’s approach to O’Hare at 9pm, Air Traffic Control asked the plane to descend to an altitude of 6000 feet.

The plane started descending in a controlled manner, descending at a rate of about 2000 feet/minute. As it passed through 6000 feet it failed to stop declining. A few minutes later, United Airlines Flight 389 crashed into Lake Michigan resulting in the loss of all on-board.

Investigations began immediately. The plane was in top mechanical condition, examination of the bodies showed no in-flight incapacitation. The black box controller showed no issues and air traffic control had issued the correct instructions. There was no obvious attributable cause for the planes demise.​

But that’s not good enough for the aviation industry. No stone was left unturned. All possible causes of the crash needed to be investigated.​ investigators noted that the design of the altimeter was different to other types of plane and begin investigation.​

Different types of Altimeter

The hypothesis was that the pilots failed to read the altitude correctly and just continued to descend, unaware they were lower than the thought.​

You might be wondering how reading the altitude could be at fault? Surely it’s simple?​

You measure the altitude using a 3 point altimeter. At the name suggested, you have three points measuring 10000, 1000 and 100. The longest pointer indicates 100’s of feet. The short stubby one indicates 1,000’s of feet. And the middle one (with the cross) indicates 1000.​

The grainy photographs show the altimeters of the data. It’s pretty easy to imagine that you could misread these by taking a quick glance. Further investigation concluded that the design in that particular plane was the most susceptible to misreading, particularly around the altitude that the plane failed to stop the descent.

Subsequent research looked at simplifying the design. In this design the 10000 feet line is always visible, minimising the chances of a bad reading.​

I’ll be the first to admit, this isn’t much clearer to me, but apparently the research shows it was clear for pilots of the time. This improved altimeter design was recommended and rolled out for aircraft instrumentation panels.

Air Florida Flight 90

Florida is one of the warmest states in the US. This meant that the crew of this plane had little experience of taking off in cold weather conditions. Like all pilots they were aware of the danger, but it wasn’t at the forefront of their mind.

As they sat on the runway the captain and his co-pilot began working their way through the take-off checklist, just as they’ve done a million times before. As they work through the list they are on auto-pilot — not taking into account the context they were in (it was very cold!). A simplified version of the dialog (available on the black box) is shown below:

Wings? 2, Check!
Landing gear? Present, Check!
Engine anti-ice on? Off, Check!

We now move onto the runway. Here’s the dialog from the black box. Look at the way in which the co-pilot is speaking:

Captain: It’s spooled. Real cold, real cold.​
Co-pilot: God, look at that thing. That don’t seem right, does it? Uh, that’s not right.
Captain: Yes it is, there’s eighty.​
Co-pilot: Naw, I don’t think that’s right. Ah, maybe it is.
Captain: Hundred and twenty.​
Co-pilot: I don’t know
Captain: Vee-one. Easy, vee-two.

The language is not directive. The plane hasn’t developed as much power as it needs for take-off despite the instruments saying otherwise. The fluffy language has masked the true danger that awaits. It’s easy for the captain to dismiss these concerns and he let’s take off proceed.

Captain: Forward, forward, easy. We only want five hundred.​
Captain: Come on forward….forward, just barely climb.​
Captain: Stalling, we’re falling!​
Co-pilot: Larry, we’re going down, Larry….
Captain: I know it.​
[SOUND OF IMPACT]

Only at the last minute does the co-pilot get directive “We’re going down Larry”. The plane stalls because it failed to generate enough lift and Flight 90 crashed.

Accident report

The investigation concluded that multiple factors were to blame for the crash. I’ve highlighted one area that led to change as a result of this crash, namely the captain’s failure to reject the take-off when attention was called to anomalous engine readings in these particular weather conditions.

The crew were inexperienced however if they had realized the gravity of the situation then take-off could have been aborted and lives could have been saved.

This (and similar accidents) resulted in the introduced of something known as “cockpit resource management”.​

Cockpit Resource Management

​This is a set of training procedures for using in environments where human error can have devastating effects.​

​The goal is to create a psychologically safe environment. That means being able to “show and employ one’s self without fear of negative consequences of self-image, status or career“.​

​In the case of the 1982 crash there were two problems. The checklist at the beginning of the flight was just read out without situational awareness (it was cold!) and then there was a further opportunity when the co-pilot realised the plane was going too slowly but failed to communicate his concerns to the captain.

The core of CRM is to foster a climate or culture where authority may be respectfully questioned (no matter the role).

​However in the case of communicating problems, this process is deceptively simple. Attention → Concern → Problem → Solution → Agreement.

Let’s imagine running through this process for the aborted take-off. It might go something like this:​

First off, grab Attention — Hey Captain.​
Next state your Concern — I’m concerned the engine instruments are not accurate​
Now the Problem — We may not have enough velocity to achieve lift​
A possible Solution — We should abort take off​
Seek Agreement — Captain, please confirm.​

​This makes it sound easy; it’s not. Now imagine going through a scenario like this when extremely stressed. That’s why CRM training is mandatory across the world and all crew have regular refresher session.

This, more than any other technology improvement, is the biggest single factor in saving lives worldwide.

Air Canada Flight 143

Air Canada Flight 143 was a routine flight from Montreal to Edmonton. ​

As the plane was over Ontario at a normal altitude of 41000 feet and travelling over 400mph the 767’s alerting system chimed four time in quick succession alerting the crew to a pressure problem.​

​At that point the pilots believed they had a failed fuel pump in the left ring and switched it off. Believe it or not this isn’t a big problem once the plane in running, gravity does the job of the fuel filling the engine. The flight computer showed more than adequate fuel and there was no other indications of any problems.​​

A few minutes later a second fuel light came on. When recalling the story, the captain said:

“Circumstances then began to build fairly rapidly”.

He’s not wrong!

More alarms triggered in the cockpit. The pilots remained calm and diverted to Winnipeg.​

​Two minutes later a completely new alarm came. It wasn’t in the simulator. Everything went very quiet.​

​Starved of fuel both Pratt and Whitney engines had flamed out.​

​At 1:21 this $40M plane state of the art plane had become a glider.​

With both engines out, the pilots searched their emergency checklist to find the section entitled “How to fly a plane with no engines”. Unsurprisingly no such section existed.​

​In an amazing stroke of luck, the Captain Bob Pearson of the plane was an experienced glider pilot. When a plane is gliding it needs to descend in order to maintain enough lift. He found the optimal glide angle whilst his co-pilot began to calculate how far they could get.​

​The altitude at this point was roughly 35,000 feet. ​They weren’t going to reach Winnipeg.​

The co-pilot served in the Canadian Airforce out of Gimli. This is where they aimed to land at a base known as Station Gimli. At their current descent rate it was in-range. There was hope.

How the pilot remembered the training camp…

Unbeknown to both pilots, there’d been some changes to the training area.

Gimli Motorsports Park

As the plane descended it became apparent they were coming in too high and too fast, raising the danger they’d run off the end of the track.​

​The plane landed on the drag strip and two factors helped avert disaster. The plane slammed into the guardrail between the drag strips, this extra friction slowed the plane down. The front landing gear failed, further increasing the friction.​

​The plane came to a complete stop. Everyone left the plane safely. Despite the engines turning off at 40K feet. EVERYONE survived.

Safely landing on the drag strip!

So what went wrong?​

​Well, at the time of the incident Canada was converting over to the Metric system and the new 767s were the first to be calibrated for metric units instead of imperial.​

When you’re filling up a plane with fuel, the volume of fuel required varies with temperature, so you use a conversion factor. The mass of liter of fuel is 0.803 kg. Unfortunately, the technicians used the conversion factor for the mass of a lb of fuel (1.77).​ This incorrect conversion factor (applied both to the fuel added and the fuel in the tank) led to the plane taking off with just 25% of the fuel it needed!​

Investigators tackled a range of issues

Aircraft investigators didn’t stop there. The in-depth investigation revealed multiple failures from ground crew interactions with the plane, dedicate roles, independent checks and upgraded flight fuel calculations.​

​This (again) led to reform and a defence in depth strategy for making sure that planes don’t run out of fuel

The F22 Raptor

Our final story is around the F22 raptor. ​

The F22 Raptor, one of the most amazing planes I’ve ever seen.

​It’s one of the few aircraft that can maintain supersonic flight without using after burners, and if it turns the wick up can fly at speeds in excess of Mach 2.​

​It’s a completely bonkers plane. The engines have directional novels increasing the manoeuvrability of the aircraft. That comes at a cost though, you need an awful lot of software to keep this flying in a straight line.​

In 1997, a group of these stealth fights was deployed to Okinawa. Six jets flew from Hawaii to Japan.​

Hawaii to Japan.

​This is roughly the route they took. It’s a long journey over the North Pacific Ocean, but it includes that purple line in the middle. The international date line.

As the squadron crossed the International Date Line the Flight Management System failed. They were unable to return the “Present Position Location”. As you might imagine, a fair few systems on a plane kind of depend on knowing where you are. This resulted in a chain reaction of failures leaving the pilots completely in the dark.​

​They tried to reset their systems, but that didn’t work. They were helpless, having to fall back to visual flight navigation.​

​Luckily, this was a non-combat situation and the weather was good. They were accompanied by KC-10 refuelling planes. The billion dollar warplanes slowly followed the refuelling planes home and eventually limped home.

So why did this happen?​

Why did the F22 fail?

​The software for the operational flight program was written as four programs all written in Ada 83 code and transitioned over to Ada 95.​

​Even back then some 20 years ago, it was hard to find ADA programmers. It was at a time when technology was moving over from analogue to digital flight controls (fly by wire) and this was one of the first aircraft to require so many lines of code.​

​There’s about 800,000 lines of code. Ada was forced upon the teams by government mandate so a lack of training in the environment may have been responsible.​

First Agile project too — not sure I want my programme manager for that project saying “fail fast, fix fast”​. ​Investigation showed the requirements were poorly specified and this particular case was not tested.​ ​There was not enough testing — in fact the test harness of the flight simulator just didn’t even include this as a possible destination to fly. They hadn’t baked quality in.​

​As a consolation, at least the fix came out in 2 days!

This failure resulted in more guidance about the software in planes. For flights avionics software more tools were used to encourage safety.​ For example, the use of formal methods or model-based development.​ This allowed avionics manufacturers to converge, apply more rigorous testing and have these independently verified.

With the recent 737-MAX incident this is clearly still an area that can be further improved.


Trending in the right direction.

So that brings me to the end. So why don’t planes crash? Well the truth is planes do crash, but it’s the processes afterwards stemming from that first convention in 1944 that have continually improved airline safety. Every incident (whether fatal or not) prompts inquiry. From inquiry, the air standards authority seeks answers (not blame) and commit to adapting based on results.

It’s this growth mindset from the entire aviation industry that has dramatically improved safety and will continue to do so into the future.


So what can the software development industry learn from this? There’s some great books below that provide some inspiration.


This was originally a talk given at Redgate’s Level Up 2019 conference. See more about us at https://ingeniouslysimple.com and we’re hiring!