Chernobyl: The True Cost Of Technical Debt
A comparison of technical debt in the Chernobyl disaster and software development
HBO’s Chernobyl drew me in instantly. Not just because of the drama or history around the famed incident, it was the way actor Jared Harris (Valery Legasov) approached mitigating the nuclear explosion that was mesmerizing to me. Watching his character mentally engineer a solution with the least risks based on the current assumptions only to run into unforeseeable issues later resonated with my engineering side.
More than anything else, Chernobyl is a stark reminder of the true cost of technical debt.
Technical debt, for those who don’t work in the software world, is the concept that bad design decisions made in code or poorly written code will need to be fixed or will cause future problems. Unlike in the Chernobyl disaster, technical debt is usually invisible in software but can cost companies millions to fix. What drives technical debt is very similar to what was portrayed in the final episode of the mini-series: A combination of bad design decisions due to cost cutting (graphite tips), human error and productivity/business needs.
For the purpose of this analogy, I will use the term “business needs” to reference the productivity targets that the factories were trying to meet.
In the case of Chernobyl, the technical debt started with the graphite tips of the control rods. The control rods were made of boron, which helped slow the reaction rate in the nuclear reactor. However, the tips of these control rods were actually graphite which increases the reaction rate. So, when the engineer went to slow the rate of the core's reaction by pressing the emergency shut off switch, the rods got stuck with just the tips in the reactor and the reactor exploded.
The graphite tips were not revealed as the straw that broke the camels back until the final episode. In the mini-series, the moment Valery mentions the graphite tips the prosecutor asks, “Why graphite?”, to which Valery responds in short “because it is cheaper”.
The graphite tips were the physical manifestation of a programmer deciding to not write robust code and failing to consider how future engineers will interact with the code. It was a short-cut that the Soviet Union used to save money at the time which led to one of the worst disasters of all time.
In the end, they did not account for the number of mistakes that the engineers of Chernobyl would make that would place nuclear reactor #4 in a situation where it would explode.
Technical debt does not often surface in the normal day-to-day use of software, just as it wasn’t revealed in the standard use of the Chernobyl nuclear reactor.
More than likely, similar to software, the engineers designed the reactor to handle 95% of the possible scenarios that it would be placed in. We often reference these scenarios as use cases.
The use cases that aren’t in that 95% are called edge cases. These are typically scenarios that seem rare, shouldn’t happen or are caused by user-error instead of system error.
Yet, because of how complex these systems are you can never really foresee every possible scenario. Predicting every edge case is impossible to do.
Based on the show (as I am not a nuclear engineer), I assume no one thought that someone would first poison the nuclear reactor with xenon which would cause it to stall and then try bringing it up to power again without putting the rods back in.
And then…to make matters worse…have the graphite tips of the control rod get stuck causing an increase rather than a decrease of nuclear reactivity.
Perfect engineering accounts for every possible scenario, it accounts for every possible human error and ensures that the users are steered in the right direction. But, there is no such thing as perfect engineering. We make assumptions, we focus on the majority of use cases and are pushed by deadlines. All this being said, the nuclear reactor would have never been in said position without pressure from upper management to meet productivity quotas.
Although dramatized, the final episode depicts the directors fantasizing about being promoted because of their successful test of reactor number 4. When they get the call to stall the test for a few hours, they conclude that it will be safe.
Did they conclude it was safe because they thought through the process of stalling the test for 10 hours or did they want to keep their superiors happy? I won’t presume to know, but I can say I have seen the need to meet deadlines or quarterly budgets force decisions that affect short term goals but eventually cause long term problems.
The push to meet short term goals like monthly quotas and quarterly forecasts often force management to game the system. They will focus on just meeting the current needs or demands without considering what might happen later on. For example, perhaps an engineering manager pushes his or her team to not include a security module of code because it doesn’t impact the functionality and it will look good on them. They might be tempted to do so even if they know it is wrong.
After all, once they are promoted, it is someone else’s problem.
You’ll Always Have That One Engineer
I enjoyed the engineer speaking out in the final episode. The truth is there is always one good engineer who attempts to stop bad practices from occurring. They are not worried about the bottom line and are not worried about speaking up to management to let them know they are making an error. These recommendations can be passed over, put into the “fix it later” category (which means never), or just ignoring it altogether.
Honestly, we need to thank these people in every company.
These are the only people sitting between us and planes falling from the sky or our banking systems going on the fritz. Their discipline for good QA and integration testing is what ensures we don’t die.
Chernobyl is the physical embodiment of technical debt in the software world. It can be hard to see the impact technical debt has on a company because it is invisible. Oftentimes, it won’t strike until the original engineers that developed the systems are long gone.
This doesn’t make it any less real.
This HBO show speaks to a concern I have as software becomes integrated into everything.
What kind of technical debt is being laced into our cars and planes as we further try to integrate machine learning and AI into everything?
With the sheer complexity of managing IoT devices, what are the chances engineers won’t make a mistake?
What edge cases are we failing to capture? How are tight deadlines and bombastic CEOs forcing their engineers to make short term decisions in order to make sure they get their code shipped on time?
Companies attempt to pretend they are driven by higher mission statements, but at the end of the day, they are driven by the same productivity metrics that acted as a catalyst for the disaster at Chernobyl.