On July 16th, 1969, three astronauts flew to the moon. On July 20th, four days later, two of them landed on the moon while the third kept solitary watch. Over 400,000 worked on the entire project and dozens of engineers sat in Mission Control. How did they manage to work smoothly together?
In previous articles, we saw that the Apollo moon landing was seconds away from disaster when the on-board guidance computer started rebooting. One of the major factors in solving the problem was that the team in Mission Control worked in a smooth, well practiced manner, with each member of the team knowing what to do, how to do it, when to do it, and what the limits of their responsibility were (be pro-active, but don’t interfere).
Before describing the way they managed this cooperation, let’s describe a few key roles in Mission Control.
Most of the people in Mission Control are “Flight Controllers”, expert engineers responsible for one or more aspects of the flight. These are experienced engineers. Although many started at NASA straight out of college, by 1969 they were experienced and had spent months or years preparing and planning for the flight.
In the acronyms competition, NASA is right up there with military (and IBM), so each “seat” has a special name. Here is a (very) partial list:
FIDO — The Flight Dynamics Officer is responsible for the flight path of the spacecraft. FIDO monitors the existing flight and prepares calculations for future changes.
Continuing the discussion in a previous article abort functional and non-functional requirements, FIDO is one of the engineers in charge of the functional status of the spacecraft — Is it flying in the right direction? Are we ready to make any necessary changes in future?
EECOM — Electrical, environmental, and consumables manager. This engineer makes sure that the many components of the spacecraft are functioning correctly. EECOM checks that there is enough oxygen for the astronauts to breathe and enough fuel for FIDO to plan flight changes.
GUIDO— Guidance Officer. This is the engineer in charge of the guidance and control computers.
EECOM & GUIDO are engineers who have responsibility for the Non-Functional status of the spacecraft — Do we have enough electrical power? Is the spacecraft overheating? Are the computers working correctly?
Not all the members of Mission Control were Flight Controllers:
CAPCOM is the Capsule Communicator, the only person who speaks to the astronauts directly. Any question or announcement that another controller wants the astronauts to respond to will be funneled through CAPCOM.
Since the space flights were planned down to the smallest margins and every second and every action could be the difference between success and failure, the focused attention of the astronauts was often the scarcest resource available. Therefore an early decision was to make the communication between space and the ground be with someone the astronaut was familiar with and could be the astronaut’s representative on the ground. Ergo, CAPCOM is an astronaut himself. When Neil Armstrong told Houston that the Eagle had landed, he was talking to CAPCOM.
FLIGHT — The Flight Director has overall authority and control of the mission. Every significant event and question goes through FLIGHT and he has final say. (The doors to Mission Control are locked during critical moments of a space flight, perhaps so that high ranking NASA officials can’t even attempt to override FLIGHT’s decisions).
In addition to the flight controllers themselves, there are many levels of support; while the flight controllers were clustered together, there were many “backrooms” where specialists for each subject worked to support their Mission Control comrade.
During a flight the flight controllers needed to be aware of the entire room and to pay attention to many simultaneous conversations. Their backroom support could spend time dedicated to looking deeper into their responsibility.
Returning to the 1202 computer alarm in a previous article, when the astronauts reported the problem the GUIDO on duty, Steve Bales, and Jack Garman, his backroom support engineer, saw the error in their console too and started investigating. Simultaneously, engineers from Grumman (the manufacturer of the Apollo lander) and MIT (the manufacturer of the guidance computer) had their own room in the Houston complex and started looking for answers too.
Now there were multiple people asking questions (astronauts, CAPCOM, FLIGHT) and multiple people trying to answer them (GUIDO, backroom, multiple manufacturers/contractors) — but while a lesser organization might have descended into chaos, the NASA engineers had planned and prepared for crises like this.
Although this is not the precise process they used, a useful illustration of how they worked is called a RACI matrix. This complicated sounding phrase means a table showing how different people need to react to different situations. In short, for every given issue the RACI matrix defines who is:
- Responsible for actually doing the work and solving the problem
- Accountable. In short, who’s in charge and calls the shots.
- Consulted about the issue. They may be asked to deliver advice or be questioned to get further information.
- Informed about the status of the issue. (i.e. the client or astronaut)
So when the 1202 alarms started going off, Steve Bales was immediately Responsible for answering two questions:
- Do we need to abort?
- What actions, if any, should we do to save the mission and the lives of the astronauts?
Jack Garman was the primary person Consulted and both he and Steve Bales could have consulted the manufacturers.
The astronauts and CAPCOM Charles Duke Jr., were kept Informed. But the final decision of whether to continue or not was by Flight Director Gene Kranz because he was Accountable
So when Armstrong said “Give us a reading on the 1202 Program Alarm” and got the response seconds later “Roger. We got — We’re GO on that alarm”, it was not a simple decision made by the CAPCOM, but the result of a delicate ballet of consultations and checks made by 4–5 people within a few seconds.
In this case, since Bales and Garman resolved the problem in seconds, there was no need to consult anyone else.
But this is only one, very simple example of a RACI table for a computer problem. What would have happened if Bales had detected a computer problem on his console. How would he have acted? Would he have asked CAPCOM to inform the astronauts?
Probably not. If the problem is not critical, why overload the astronauts with unnecessary information?
Do the Flight Director and CAPCOM need to be aware of every problem? Suppose there’s a minor error… why overload them with the problem? So in certain cases, the GUIDO position can take ownership of the issue and be accountable. In addition, in this case there’s probably time to involve the contractors too, since a problem of lower urgency means more flexibility of time.
Now, let’s combine these 3 tables together:
Just imagine a table with thousands of rows and dozens of columns and you’ll get a taste of the complexity of juggling so many people and so many potential problems. It’s not for nothing that being a flight controller required nerves of steel.
The spaceflights had two key documents (actually there were hundreds if not thousands of documents, but two were key) which described these rules, roles and responsibilities — the Flight Plan, which described the mission down to the minute, and the Flight Mission Rules, which describe the actions of Mission Control to resolve problems which may occur during flight.
There are many parallels between these Apollo RACIs and the ones used to map out responsibilities in modern organizations.
Astronauts are the equivalent of clients, with CAPCOM as their representatives. The Flight Director is the owner of the business service while the flight controllers and backroom experts are a combination of Operations, Developers, DevOps teams, Site Reliability Engineers and so on.
Contractors and Manufacturers remain Contractors and Manufacturers, despite the gap of years and technologies.
Making a good RACI matrix looks simple and should be simple, but is often bogged down by ignoring the bigger pictures and lessons that can be learned during the creation process. In the IBM Garage, we have a series of well practiced techniques which we use to help our customers develop useful RACI matrices which they use for their own Moon Missions.
I hope you’ve learned from the fourth article of this series.
As I write this (and if you are reading this during the week of publication), it is 50 years to the day of the historic Apollo 11 mission to the moon. I cannot help but be inspired by what happened back then.
Next time, we’ll discuss some of the technicalities of how the controllers and astronauts communicated. Then, it was called “loops”. Today we’d call it ChatOps.