“That’s one small step for [a] man, one giant leap for mankind” — Many people are familiar with one of the most famous quotations of all time, but how many people know that Neil Armstrong was only seconds away from saying “We didn’t land on the Moon because the computer rebooted”?
2019 marks the 50th anniversary of the first Moon landing. While this may seem to be merely an historical event, there are many relevant lessons to be learned from the actions of NASA engineers and astronauts when they designed, built and ran their systems — from the fragile spacesuits and Lunar Module, through the gargantuan Saturn V rocket, all the way through to Mission Control and the computer data centers that served them.
The gap between exploring space and concepts such as Service Management, DevOps, Observability and Site Reliability is not as great as might be first imagined. Indeed, many Network and System Operators explain their work using metaphors from the movie Apollo 13. Google even calls part of their Site Reliability Engineering (SRE) program “Mission Control”
In this series of articles, I will discuss some concepts and processes which are central to the field of Cloud Service Management & Operations which can be illustrated by similar concepts from the space program.
Minutes before the planned moment of the first landing, flying at a speed of over 1,300 kilometers per hour (1,200 feet per second) and 10 kilometers (33,500 feet) above the barren surface, the astronauts of the Apollo 11 mission, Neil Armstrong and Buzz Aldrin were concentrating on flying their strange looking lunar lander and making sure they weren’t running out of fuel. The last thing they needed was the computer suddenly flashing esoteric error messages on the LED screen.
While the astronauts continued flying, the engineers at Houston Mission Control rushed to interpret the errors — the computer was rebooting mid-flight. And not once, but repeatedly.
With the astronauts low on fuel and skimming closer and closer to the surface of the moon, a decision had to be made quickly — abort, or continue flying with a misbehaving computer and risk a crash?
Fortunately, a rigorous training and operations system had prepared them for nearly every eventuality.
The engineers (flight controllers) had thick books, with descriptions of every possible behaviour or combination of reactions in the Apollo systems. Today, we’d call them Runbooks and use automated search engines to comb through them for the error messages. In 1969 they used a combination of index cards and human memory to match the cryptic error code to the explanation. In a nutshell, the computer was saying “I’ve run out of resources, I’m rebooting and starting over!” again and again, every few seconds.
While this error was documented, it was a deep diagnostic error, never expected to be encountered in flight — and worse, no remediation action was available.
The astronauts tersely requested information and the flight controllers needed to find an answer quickly — could the Lunar Module be trusted?
So close to the Moon, the difference between landing successfully and crashing was very thin.
The flight computer’s main controller, Steve Bales, used a second analytical tool at his disposal. The trigger to abort was not merely “the computer is behaving abnormally” but “the Lunar Module is not flying properly”. He checked what today we would call his most important Key Performance Indicators (KPIs) or “Golden Signals”.
Is the Lunar Module flying at the right speed, direction and angle?
Can the astronauts control it?
Does it have enough fuel to land?
As long as the reboots were intermittent then the lander could still function correctly. The astronauts got the answer they needed — “You are go for landing!”.
The rest, as they say, is history.
In the more down to earth world of modern day Cloud Service Management & Operations, our Golden Signals do not reflect our readiness to land on the moon but the capability of our systems to serve our customers. Therefore we often define these critical Golden Signals as the Latency (how slowly the system is responding), Traffic (how many requests the system is getting), Saturation (how “full” the queues or containers are) and Errors (this one is rather self-evident).
When we monitor our systems, while it’s important to measure many metrics, it is critical to define and measure the Golden Signals because they are the ones that tell us if there is a problem that may be affecting our end users. Technical issues certainly need to be solved, but the business can continue as usual in the meantime.
I hope you’ve learned something from the first article of this series. In the following article I will discuss a problem that occurred during John Glenn’s 1961 flight and how it was one small light bulb away from disaster.
I’d be pleased if you’ll join me.
Learn more about IBM’s concepts of Cloud Service Management and Operations.
Watch a video which will show you how you can simplify application monitoring with Golden Signals.
As well as links relevant to the technical lessons of each article, I’ll also be leaving at least one link where you can learn more about the historical event discussed.
You can watch a real-time journey through the first landing on the Moon on the https://apolloinrealtime.org/11/ website — see if you can find the exact moment when the alarms went off!