How lines of code made a rocket explode.
Story of the most expensive software bug in the world.
On a fine June 4th, 1996, just 37 seconds after the celebrated launch, Ariane 5 rocket flipped 90 degrees in the wrong direction. Mission critical alarms started banging inside mission control; off the coast of French Guiana.
The rocket was making an abrupt course correction that was not needed, compensating for a wrong turn that had not taken place. In short, the on-board computer had concluded that a deviation in telemetry had occurred and it took measures by correcting the course — turning the rocket 90 degrees in the wrong direction. The computer looked at the Flight data coming from the Inertial Guidance system — the eyes and ears of the rocket- and saw Bizarre and nonsensical data —
but nevertheless, the inertial guidance system was passing off that data as authentic flight data- which really was not.
Somewhere inside the code base of the Inertial Guidance system, a sub-routine code module written in Ada — a statically typed, object oriented programming language extended mainly from Pascal, was trying to cram a 64 bit floating point number into an unprotected 16 bit signed integer, causing a processor trap — and a hardware exception, informally known as an overflow.
At T plus 37 seconds, the guidance systems of Ariane 5 shut itself down.
Less than two seconds later, at a height of 4 km, massive aerodynamic forces ripped the boosters apart from the main stage of the misaligned rocket. This triggered the self-destruct mechanism, and the spacecraft was engulfed in a spectacular fireball of liquid nitrogen, along with its payload of four expensive uninsured scientific satellites.
Someone hadn’t written proper exception handlers into the code base and Ariane 5 was now a fiery rubble across the mangrove marshlands of French Guiana, spread across 12 Square miles.
It took the European Space Agency 10 years and $7 billion to produce Ariane 5, and it was intended to give Europe an overwhelming supremacy in the commercial space business. But the mission was doomed.
The destruction of the scientific satellites delayed scientific research into workings of the Earth’s magnetosphere for almost 4 years.
The European Space Agency assembled a team to recover the logs from the two inertial reference systems. The Debris from the exploded rocket was spread over almost 12 Square kilometers. The team navigated dangerous marshland terrain and hazardous chemicals dispersed from the rocket to recover the log.
The forensic analysis later quickly identified the fault as a software bug hidden deep in the rocket’s inertial reference system. This system was used by the rocket to determine whether it was pointing up or down, which is commonly called as the Horizontal Bias, or the BH value. A 64 bit floating variable was used to represent this, which was perfectly adequate.
A 64-bit variable can represent billions of potential values while a 16-bit integer can only represent 65,535 potential values.
But somewhere later in the code, in Layman terms, an assignment operation was being done for some purpose.
That means, this was being done:
16-bit integer a = 64-bit float BH; (BH stands for Horizontal Bias)
This line of code was the ticking bomb. The Achilles heel. The time-bomb. This assignment works as long as the BH variable is under 62k value. In the initial seconds of the launch, the rocket’s acceleration was low, so the assignment operation worked seamlessly.
This bomb bug would go off in the exact same moment the BH variable surpass the magic value 32,767 — the largest number that can be stored in a 16 bit signed integer.
As rocket’s velocity increased, the 64 bit variable exceeded 65k, and it couldn’t be stored in the 16-bit a variable anymore. During that split second in the processor-clock, internal type casting failed and created an exception.
The processor encountered an operand error and the system populated the 64-bit BH variable with a diagnostic message.
That means, at T + 37 seconds, the BH variable inside the Inertial unit had a diagnostic error coming from the processor instead of actual flight data, and this diagnostic message was passed as flight data into the On-board computer, which couldn’t comprehend it.
When the guidance system shut down, it had passed control to an identical, redundant unit, which was there to provide backup in case the primary system fails.
But the second unit had failed in the identical manner a few milliseconds before. Why?
It was running the same software from Ariane 4.
A few key takeaways for a better engineer.
Several factors, primarily blatant ignorance, contributed towards this failure. One ridiculous fact is that the BH value wasn’t even required after launch for Ariane 5. The code base including the infamous variable was simply left there from the Rocket’s predecessor, the Ariane 4, which actually required this variable post-launch.
This is an instant of what is known as Cargo Cult programming, which has caused various accidents worldwide. I have written extensively about it in my article here.
Another cause is that there were hardware limitations on Ariane 4, which lead to performance constraints. This lead to omission of exception handling for 4 variables, in Ariane 4, including the BH variable, to optimize the processing. This hardware performance constraint never existed for Ariane 5 — exception handling could be written in without performance loss- but someone chose to copy the code over instead of analyzing it first.
The last nail on the coffin was negligence to accommodate the change in user requirements — to be specific, the flight plan. The Ariane 5 launched with a much steeper trajectory that the Ariane 4 which resulted in greater vertical velocity to escape gravity. As the Ariane 5 rocket sped to space much faster than Ariane 4, due to this trajectory difference, there was a high certainty that the BH value would surpass its permissible 16-bit value and would encounter the conversion error.
However, the haloing absurdity is still the fact that calculation containing the bug, which crashed the guidance system, which confused the on-board computer, which steered the rocket off-course, actually served zero purpose once the rocket was in the air. It was like some software engineers had chosen to include the module to serve a makeshift purpose — a “special feature” meant to make it easy to restart the system in case the countdown is paused. It is almost as if, they had to find feeble justifications to copy over the modules from Ariane 4, instead of writing a new one.
A lazy over-slept engineer or a bad programmer maybe.
It is so absurd that the inclusion of that module almost feels ritualistic.
Therein lies the difference between a coder and a software engineer. I am writing about this bug to shed light into the design thinking you need to go through before you can be a good engineer. By going through mankind’s bugs, we get a clearer picture about where we screwed up. We have man-killing self-driving Ubers and Exploding SpaceX rockets and iPhones failing to unlock faces during the product launches. All these bugs create new stories. The story of the software itself and the way they evolved. A study about bugs is a study about software itself.
“When the great library burnt, the first 10 ,000 years of stories were reduced to ashes. But those stories never really vanished . They became a new story. The story of the fire itself.” — Dr Ford, Westworld.
A study about bugs is like learning history. You can spot it when it repeats; primarily because it arises from abstract mind-set and attitude.
Software evolves through time; it accumulates the logic and wisdom of thousands of engineers nurturing it- and grows out to millions of lines of code, branching, unfolding and intertwining; across multiple threads and branches. It is more like a living organism. Someone who author software in the current world should understand this. You write code for the world. So you can’t skip the world. It is your target environment. Your garden of Eden. You need to know as much as you can. You don’t need a syllabus. Essentially, everything operating around you is your source of knowledge and wisdom. Even the suggested feed in which you are reading this story.
So develop a kind of thinking and attitude that lets you see this world as a massive treasure hunt — your infinity stones are information and knowledge you acquire from around you. Observe the processes, make hypothesizes, validate them, talk with others about it, make a note of it, and write about it. Unwrap the wrappers. Dive into the code libraries.
Get your hands dirty with code.
If you need to be a better software engineer, you need to be more than someone who just codes in functional requirements into an IDE. You need to understand the world for which you write code. You need to understand the people who are going to use your software. You need to acknowledge that other programmers are going to add their wisdom to your code at some point of time, so it has to be as future ready and accommodating as it can be.
You have to understand the underlying design patterns of the world itself, to write better code. You have to take the red pill.
These radical notions serve as the defining line in between a coder and an engineer.
“The problem is that most of the software engineers no longer understand, the problem they’re trying to solve, and don’t care to,”says Nancy Leveson, the MIT software-safety expert. “A colossal amount of time is spent on getting the code to work, instead of thinking out and anticipating scenarios for the real world. “Software engineers like to provide all kinds of tools and stuff for coding errors,” she refers to IDEs. “The serious problems that have happened with software have to do with requirements, not coding errors.”
These days, it can be said that new worlds are built in code. People perceive the world through your code. As the maps they use to navigate the world, the songs and movies they get suggested in their feeds, the cars they drive, the filters they apply on their selfies and all the virtual worlds they immerse themselves into. Everything has code. Everything is code. It’s a big, huge matrix we are building and collaborating in code — it better be a good one.
So it is quite important that when you code, be like Tony stark expecting the End-game. See you next time !