On August 14, 2003, at 3:41 PM Eastern Daylight Time, nothing happened.
Or at least, that’s the story that operators working for FirstEnergy in a power substation in Wadsworth, Ohio, just a few miles southwest of Akron, had in front of them. All systems seemed operational. No alarms were firing. So when the phone rang and operators from the neighboring American Electric Power substation informed them that a 345 kV interconnect with Canton was failing, they did what any operators would. They checked the system, and informed American Electric that everything seemed fine on their end.
60 seconds later, every control panel at the substation lost power. Wadsworth was in the dark.
HBO’s wonderful miniseries Chernobyl—about, well, the 1986 Chernobyl Disaster—just wrapped this week, and as someone who debugs and operates computer systems for a living, and reads books about Chernobyl in his spare time, I thought I’d take the opportunity to point out some of the lessons that Chernobyl has to teach us about software engineering.
I don’t work at a nuclear power plant. The stakes when production goes down and tensions rise seem much lower, but as an industry, we experience issues similar to some of the ones that happened in the control room and its aftermath quite frequently. And some software bugs can be lethal—whether it’s in an avionics system or a car’s throttle control or a tiny piece of monitoring software installed on a Unix computer that just happens to be used in a power substation.
Obviously I’m going to spoil everything that happens in Chernobyl, much like the Wikipedia article on the real Chernobyl disaster would.
An All-Too-Brief and Poorly Explained Rundown of Nuclear Power and Chernobyl (With Several Errors)
The Chernobyl miniseries saves most of its explanation of what actually goes wrong at the power plant for the gripping final episode, where we get to see what a postmortem looks like when it’s conducted by a kangaroo court.
Cost, schedule delays, poor management, flaws in manufacturing processes, inexperienced operators, misunderstood interactions between complicated engineering systems, politics, and hacks to make things work better and more efficiently are all responsible for the disaster…sound familiar yet?
Let’s quickly review what happened on (and leading up to) April 26, 1986.
Worth noting here that I’m not a nuclear engineer of any variety. I took a class on it six years ago and read a bunch of books about Chernobyl. I’m entirely unqualified to give this rundown and I’m definitely gonna mess something up.
Most power plants around the world operate on the same basic principle: they do something to heat up water and produce steam (the something is how they all differ). That steam is run through a turbine, which is basically a giant fan that turns when a bunch of pressurized steam is shot at it. The turbine is attached to a generator, which is a large-scale magnet that moves around inside a solenoid (coil of wire), creating changes in flux and in turn producing electricity.
In a coal plant, burning coal creates a fire which is used to heat water, producing steam which turns a turbine which turns a generator producing electricity. In a natural gas plant, the burning of gas directly pushes a turbine which turns a generator producing electricity. In a solar power tower, the sun reflects off mirrors to heat up water which boils into steam which turns a turbine which turns a generator producing electricity. Garbage plants burn garbage to boil water to turn a turbine which turns a generator producing electricity. Hydroelectric and wind plants both skip the boiling water step and just turn a generator directly. Photovoltaic solar is basically the only major form of electric generation that’s entirely different.
In a nuclear power plant, nuclear fission is used to heat up water (fusion plants are still an experimental concept not viable for commercial generation). Heavy unstable element isotopes like Uranium-235 absorb an additional neutron, causing the isotope to break apart, in this case into multiple lighter elements and a few free moving neutrons. Since energy is pretty much never 100% transferred perfectly, some of that energy is given off as heat. We use that heat…to boil water into steam which turns a turbine which turns a generator which produces electricity.
Radioactive elements and radiation occur naturally in nature. If you’re sitting next to someone while you read this, you’re getting a very small dosage from the potassium-40 in their body, and you’re also giving some back off to them (for what it’s worth, radiation doesn’t automatically turn other things radioactive, and small dosages don’t have a measured effect on your health, so you can scoot back over to them now). It just so happens we have a lot of uranium on Earth, and a good chunk of it is U-235. We take naturally occuring uranium (which has U-235 in it) and enrich it (separate enough of the parts that aren’t U-235 so we have usable fuel) in centrifuges. Now we just need a few key components to make it react:
- A moderator slows down free neutrons, allowing them to strike fissile uranium-235 and cause a reaction
- Control rods made of materials which absorb neutrons but are stable (won’t cause fission reactions) are used to capture free neutrons and slow down (by being inserted) or speed up (by being removed) the reactions
- A coolant transfers heat energy from the reactor core to useful places (steam turbine generators) and ensures that the reactor does not overheat and melt through the reactor vessel
And we’re basically done. In the US designs for nuclear power plants, we use light water (i.e. normal drinking water) as both a coolant and a moderator. This design is called a light water reactor. This design also has a nice side effect: if we lose too much coolant (due to evaporation into steam), we also lose the moderator, and the reaction slows (or stops entirely). This is called a negative void coefficient; if there’s a void (empty space) in our core, it creates a negative reinforcement effect which lowers (or stops) reactivity.
Chernobyl was not a light water reactor. The design used at Chernobyl was the RBMK, or Reaktor Bolshoy Moshchnosti Kanalnyy (it means High Powered Channel Reactor in English). Chernobyl used graphite as a moderator, which is a solid metal and can’t act as a coolant. For coolant and transfer of energy, cold water is run through the reactor core that will heat, rise, and be boiled off as steam…you guessed it, turning a turbine turning a generator producing electricity.
There are things that can go wrong, and safety conditions built in to handle them. Here’s a point by point summary of what definitely went wrong at Chernobyl, much of which is covered by the show:
- Proper containment structures were not built due to an absence of materials. Due to pressure to complete the plant on time, poor materials were often used in absence of requested construction materials. Most notably, the roof of the reactor hall used bitumen (highly flammable coal) instead of flame retardant material. This would be kind of like if you asked me to build you a seat belt and I gave you a stick of dynamite.
- The plant was being run at low power levels for a long period of time, causing excess xenon gas to form within the reactor. Xenon acts as a neutron absorber similar to control rods, slowing the reactivity of the core.
- The plant was moved to even lower power and passed the subcritical threshold, effectively stopping all reactivity. Once a nuclear power plant is shut down, xenon gas must be allowed to dissipate before it can be restarted.
- Despite this, it was still possible to remove an unsafe number of control rods and bring the plant up to some power. This was done. The minimum safe operating constraints for the RBMK-1000 called for at least 30 of the removable control rods to be present in the core. Chernobyl operators removed all but 6 of these (18 in total counting 12 which were cemented in and could not be removed) to restart the plant.
- A planned test of the plant’s emergency generation capabilities was run, shutting down the externally powered water pumps, despite the fact that the backup system being tested wasn’t expected to work at this low power anyways.
- Loss of water led to increased reactivity in the core due to the positive void coefficient in the RBMK design. Chernobyl Reactor 4 began to rapidly increase reactivity.
- Xenon gas isn’t an issue when a nuclear plant is running at full power because it’s burned up in the reaction. The increased reactivity burns up xenon gas, increasing reactivity further, burning more xenon gas, causing the reactor to dramatically rapidly rise to full power.
- Fully removed control rods took between 18 and 23 seconds to fully reinsert during a SCRAM. This was a flawed design decision made because it was believed an abrupt stop of a power plant would cause other issues in the Soviet electrical grid.
- Control rods were tipped with graphite (the moderator used in the reactor) because they could not be fully removed from the core. Tipping them with graphite ensured that the part of the control rod that would always be in the reactor would not negatively affect reactivity.
- Activating the emergency SCRAM system to reinsert all control rods further increased reactivity as graphite tips reentered the core and displaced the water beneath them.
- As reactivity increased beyond plant design limitations, extreme heat cracked open fuel rods. This damage blocked many of the control rods from being fully inserted.
- Water being used as a coolant was evaporated into steam which under pressure produced an explosion.
- A second, larger explosion that destroyed the reactor and took out the roof of the reactor building occurred a few seconds later. This was probably caused by either the buildup of hydrogen gas being ignited, a thermal explosion due to total loss of coolant, or another steam explosion.
- Following the explosions, nuclear reactions continued with no coolant or control rods to manage them and no containment structure to keep in radiation or radioactive materials. Oxygen reacted with hot graphite ejected from the core and started a fire.
There’s a lot more interesting details here; I highly recommend Midnight in Chernobyl or Chernobyl: The History of a Nuclear Catastrophe, both excellent books published recently that go in depth on the disaster and fallout.
At any rate, the miniseries now gives us a lot to think about in how not to manage an incident or run a postmortem.
A lot of shame can be thrown at the deputy chief engineer of the Chernobyl plant, Anatoly Dyatlov. Dyatlov was responsible for conducting the emergency safety test that was happening on the night of the disaster and the ranking person in the control room of reactor 4 when the explosion occurred.
Dyatlov does not come off well in pretty much any retelling of Chernobyl, and for good reason. He’s the dangerous mix of arrogance and incompetence that causes the disaster in the first place, overriding the correct but inexperienced staff under him and refusing any explanation that can’t result in a successful test. He’s basically that boss who thinks that yelling “get it done anyways” is an effective leadership strategy.
Dyatlov didn’t know about the low-power positive void coefficient or control rod reinsertion issues because there wasn’t anyone at Chernobyl who did. But he definitely should have known about xenon poisoning, and he should have called off the test when he didn’t have the proper staffing for it in the first place, and he should have listened to his operators when he didn’t know what to do and they were telling him to stop.
But the fact that Dyatlov might be at fault doesn’t make for a good blameless postmortem. There’s still questions to be asked: why was someone without good leadership qualities and nuclear plant experience put in charge? Why did operators feel compelled to listen to him despite not trusting his judgement? What pressures was Dyatlov himself under to deliver a successful test result? The answers to these reveal deep cultural problems that go beyond one man in charge or even one power plant.
Chernobyl does a great job showing that the culture surrounding the disaster investigation does not start from a blameless postmortem state. And this creates huge problems. In one scene in the second episode, Reactor Chief Viktor Bryukhanov hands the arriving incident management crew, lead by politician Boris Shcherbina, a list of names. Not root causes, but people to blame.
Of course, Bryukhanov is engaging in some self-preservation behavior here; by the end of this all, he’ll be on trial with the other leaders of the plant. The fear of being blamed and the need to blame others is intertwined.
In these early moments after the explosion, several people ask the question that becomes something of arc words for the series: “How does an RBMK reactor explode?” Asked here it’s to discredit people claiming that there’s been an explosion at all. It’s a weapon for the blameful postmortem culture, shutting down exploration of the problem because no one wants to be seen as stupid, even if it’s ignoring the clear truth (that a giant hole is in the place reactor 4 once was).
When it’s asked again later in the series, it’s a legitimate question, driven by the fact that the scientists investigating the incident don’t actually know how this kind of failure can happen. This is a failure that has never happened before or even been imagined before. Or…was it?
Debugging After The Fact
It’s December of 1983. Operators at an RBMK nuclear power reactor scheduled to come online by New Year’s Day on 1984 are conducting one of the final remaining safety tests to ensure that the plant is ready to be fully operational.
The Ignalina Nuclear Power Plant in Visaginas, Lithuania was essentially a sibling plant to Chernobyl. Both plants were using the new RBMK reactor design. Both plants were intended to be among the world’s largest and most powerful nuclear power plants ever constructed. Both plants were part of a plan for a new atomic golden era in the USSR. And both plants were scheduled to come online by 1984.
At Ignalina Reactor 1, operators were preparing to SCRAM the reactor as part of the test. They’d removed a number of control rods and pushed the AZ-5 button to reinsert the rods and fully shutdown the reactor. But something peculiar happened. The reactivity went up. The power output spiked.
In this test Ignalina Reactor 1 didn’t have a dangerous number of removed control rods, and wasn’t running at low power for a day before the experiment. The power output went back down as the control rods fully inserted, and the plant shut down as intended. No disaster occured. Still, the operators knew that the spike in reactivity wasn’t just an equipment fluke. They reported it in their test results. As translated in the International Nuclear Safety Advisory Group’s 1992 report on the Chernobyl disaster (pdf), their report read in part:
When the reactor power decreases to 50% (for example, when one of the turbines is switched off), the reactivity margin is reduced as a result of poisoning…Triggering of the EPS [emergency protection system] in this case may lead to the introduction of positive reactivity. It seems likely that a more thorough analysis will reveal other dangerous situations.
They made a number of safety recommendations based on their observations, including conducting further investigation and that until that investigation was done, “the number of rods which may be fully withdrawn from the core (up to the upper limit stop switch) should be limited to 150 for the RBMK-1000 reactor.” Chernobyl Reactor 4 had nearly 200 fully withdrawn control rods on April 26, 1986.
It’s not quite that Chernobyl operators weren’t reading the postmortems put out by colleagues in other plants. The successful SCRAM caused the results in the report to be buried. The test was fine. The spikes were a fluke, probably a mis-measurement, and at any rate, reporting them widely would damage the reputation of the state-of-the-art RBMK reactor.
Now the good news for software engineers is that there’s generally not a state government censoring our access to Hacker News. We are thus allowed to read all the postmortems about how Panacea.js is actually severely flawed in spite of all the posts two years ago about how it was going to solve all our problems.
However, the human factors that lead us to disregard little incidents crop up all the time. When’s the last time you started trying to trace a problem causing a script to crash at seemingly arbitrary times after running for a couple days, couldn’t quite figure it out or force it to reproduce, needed the system to work, and just wrapped the service in a systemd unit and told it to restart automatically if it crashed (this is definitely not a thing I did last month why do you ask)?
Getting things working now and ignoring systems that aren’t actively broken is a classic move in the software engineering playbook. When things (probably inevitably) explode from a lack of understanding the real bug, you’re likely to hear an engineer exclaim “oh yeaahh…” at some point during the incident.
The understanding we have of a system affects how we operate it, so teams need time to debug and explore issues and an environment where “get it working right now” and “be able to debug the issue” are kept out of conflict as much as possible. Bryan Cantrill’s Debugging Under Fire talk goes into great detail on this, just watch it.
Once we understand our system well and put a bunch of monitoring into it, we’ll have the details we need to observe the system live as the incident unfolds and understand it. If we do that, there’s nothing we can do wrong when it comes to debugging it, right?
The Data We Have and the Data We Need
The inside of a reactor core, especially a very large reactor core like the RBMK-1000, is hard to observe. Reactions happen on the order of tens or hundreds of nanoseconds, neutrons are moving somewhere between 2500 m/s and 20,000,000 m/s, and everything is a tiny subatomic particle.
Control rooms have a lot of high level information about what’s going on with the reactor; how much power the generator is producing, for instance, and which pumps are operating, and which control rods are inserted and how far. All this information lets operators make decisions like “we have a stuck valve, we should fall back to the backup pumps” or “there’s too much power rising too quickly, we should insert a few more control rods.”
These systems also have alarms to notify operators when something requires immediate attention or is outside the designed parameters of the system, like “there’s radiation leaking from the core,” or “the core temperature is too hot” or “there’s no power being generated.”
There is not an alarm that says “by the way, the whole core just exploded, destroying literally every sensor and electronic monitoring system you had in place, and it’s now just a smoldering fire pit spewing more radiation than you can even comprehend into the night air, so you might want to do something about that.”
The control room of a nuclear power plant is effectively a monitoring dashboard. It’s designed very well by engineers thinking about what could go right, what could go wrong, and what could go horribly wrong all at once, but it still is going to have shortcomings of information.
In episode 5, we see that one of the key pieces of information is “how much power is the plant outputting?” Immediately following the explosions, the answer to this question will be zero. This is not because the reactor has stopped—in fact, the nuclear reactions are still occuring, just without any control. It’s because the reactor has been destroyed, as has its ability to turn a turbine turning a generator producing any electricity. The electrical output from the reactor is zero, but why?
Lack of information drives a number of immediate decisions that turn out to be flawed—for instance, Dyatlov’s obsession with getting coolant flowing back into the reactor core isn’t a bad one, but it assumes a reactor core still exists, which it doesn’t. Dosimeters on site are designed to be high-precision and measure the kind of radiation dosages that come from potential leaks, not the kind of radiation one would find in the dead center of a reactor core in the process of reacting all of its fuel at once, so they mostly max out or break.
Since no one has seen an event like this before, good data is often assumed to be bad. Unexpected results are discarded as related to the other issues but not useful for building a real picture of what’s going on. In Episode 5 before the explosion we see Dyatlov ignore an alarm saying the reactor is running with too few control rods and needs to be shut down, pointing out that the computer “doesn’t know about the test.” How often in software do we silence data that we didn’t explicitly expect but that we can quickly fill in an expectation for?
Further, the radiation numbers themselves are staggering. When the site’s military DP-5 dosimeter is recovered and that also maxes out, the device is assumed to be defective. There’s only the one, so at this point, it’s potentially a reasonable assumption (though I’d assume the usual failure state of a dosimeter is to measure no dose, not all the dose). Several times people get into an argument over whether someone means milliroentgen or rotengen which is reasonable. Imagine if someone told you your web app’s average request latency was hovering around 2400 seconds (that’s 40 minutes). You’d tell them the result is definitely in milliseconds, right? 2400 seconds is just an unreasonable number!
When we’re managing incidents and debugging situations we don’t understand and haven’t seen before, we have only the sensors and monitoring that people who haven’t been in this exact situation have built. There are almost certainly going to be deficiencies in our data.
The only way to proceed is to act like a scientist. Make a hypothesis about what’s going on: “maybe the control rods inserted and the reactor shut down, but the water pipes overheated and turned to steam and exploded first.” Make statements about what you’d expect to see and what you wouldn’t expect to see if your hypothesis was true before checking that data: “the reactor would have no electrical output” (check) and “coolant pressure would drop to zero” (check) but also “the control rods would report being inserted” (no check) .
You’ll also want to come up with other things that might explain these outputs as well. Avoiding confirmation bias means coming up with alternate explanations for your results and disproving those. So the dosimeter might be reporting 3.5 roentgen per hour because there’s that much radiation. But it might also be maxed out, and if that were the case, a more powerful dosimeter would report a higher radiation level (check). We would now have to reject the 3.5 roentgen per hour number.
The good news is that following the scientific method when you’re in the middle of an incident is super easy. You make a slide deck, present to your team, and then everyone remembers it when it’s 2 AM and production has crashed and everything is on fire and your key management server won’t start up so how the heck do you even log in to anything else and why is the MOTD on this box a pentagram anyways? As every good engineer knows, the best judgements are always made on limited sleep during an outage.
Incident Management, Status, and Resolution
In September of 2010, construction began on a building known as the “New Safe Confinement.” The price tag for constructing this building was about $1,700,000,000 (1.7 billion US dollars). It involved thousands of workers from dozens of nations and is still (as of June 2019) undergoing testing, with most of the major construction work having wrapped up at the end of 2018.
So is the Chernobyl incident resolved? The miniseries focuses heavily on the efforts to understand and handle the incident that occurred following the explosion; we open on the explosion as it happens, and only at the very end go back to the moments leading up to it. Those efforts span months and include everything from firefighting to evacuations to boring a tunnel underneath the plant to prevent the contamination of groundwater.
These effects and costs of Chernobyl are likely to be felt far into the future, though. In many ways, the incident is ongoing, having passed through many people in charge, some who did a good job, and some who did not.
The real impacts of Chernobyl could have been much, much worse. A lot of credit is owed to the decisions made during incident management, especially after the severity of the incident was truly acknowledged. Acknowledging the incident was a crucial first step that was dangerously delayed, however.
State secrecy isn’t normally an issue in software engineering incidents (unless you’re collecting zero days for a government agency) but other similar factors are. Trying to quickly resolve an issue before clients notice or not disclosing a bug that might have allowed for a major privacy breach between users under the hopes no one noticed or exploited it in time is a pretty frequent practice at tech companies.
The danger of moving fast during an incident is clear. According to Midnight in Chernobyl, Dyatlov ordered two engineers to try and push down the stuck control rods manually. He quickly realized the flaw in that idea—control rods are extremely heavy, and if they weren’t falling under their own weight, no human was going to be able to force them in—but he was too late to call the engineers back. Both received fatal doses of radiation and died a few days later.
Quick fixes are often the enemy of careful, methodical incident management. On one team I was on, an engineer noticed a bug in the deployment of a data processing job that was causing the job to be queued but not properly run. Thinking the issue had to do with concurrency settings, they dropped the cap on the number of jobs allowed to simultaneously run in production. Since it was, in fact, 1 AM, they then went to sleep.
In fact, the jobs were running, but just rapidly failing and requeuing to retry. With many more allowed to simultaneously run, the system continued spinning up more and more EMR clusters until it eventually exhausted the number of available resources in us-east-1.
Since Amazon at the time billed on an hourly basis, an EMR cluster that came up and instantly shut down still cost as much as one that successfully ran the jobs. These weren’t small clusters, and what could have been a minor bug ended up being a massive incident, not only because the resource exhaustion was contending with otherwise functional services, but also because we had managed to rack up a multi-million dollar AWS bill doing it (Amazon was kind and refunded us most of that cost later on).
We need to fight the first instinct of wanting to immediately get everything back up. Proper incident management means putting someone in charge whose main job is to take deep breaths, remain calm, and filter operational and communications decisions. Rotating this person helps ensure everyone on the team develops these skills, too.
It takes work to fight these instincts, but panic and poor communication cause problems everywhere, not just in the Chernobyl control room.
Let’s check back in on our friends at FirstEnergy from the start of the article. We’re looking in on a single control room that happens to be in the middle of a series of events which led to the 2003 Northeast Blackout. The blackout winds up lasting for days (weeks in some areas) and leaves an estimated 55 million people without power in major cities like New York, Toronto, Detroit, and Cleveland.
This will contribute to an estimated hundred deaths and many more injuries. Home medical equipment will fail. Air conditioning units will fail, causing heat stroke. People will run generators in their houses and die from carbon monoxide poisoning. Inadequate lighting will lead to traffic deaths overnight. In one case, candles being used for lighting will lead to a lethal fire.
A race condition prevented the alarms at FirstEnergy from firing properly. Operators were relying on outdated information and didn’t react because of it. This is far from the only cause of the blackout, but it goes to show that software bugs are becoming more and more prevalent in the causes of serious, real disasters.
We won’t be able to prevent the software we write from being used in critical situations (though we can try). Critical applications aren’t going to build their own containerization solutions, data processing engines, runtimes, monitoring frameworks, operating systems, dashboards, and web browsers just to avoid interfacing with your code.
Of course, most of us don’t deal with life-threatening incidents on a day-to-day basis. But the same basic principles can guide us through outages and ensure our software runs just a little bit better. To summarize:
- Keep a blameless culture where people are digging for why the incident has happened instead of who caused it. Even when an individual seems directly responsible, there should be questions around how they were able to get into that situation.
- Don’t ignore little bugs or issues that seem to go away. Dig into what caused them. Give your team time to debug and understand issues that cause someone to say “hmm, that’s funny.” Surface resolution of an incident is never enough.
- Make hypotheses about what is happening and then test them using the data you need instead of relying on the immediately available data to form a hypothesis in the first place. Don’t fall victim to confirmation bias and actively look for things that would disprove your hypothesis.
- Understand the impact of your incident and don’t consider the situation resolved until the impact is fully mitigated and has been communicated with everyone affected. Have clear delegated responsibilities and react slowly.
- Put a containment structure around your nuclear reactors. You know, before anything explodes.
Thanks for reading!