We’re Go On That Alarm: Inside the Apollo Operating System

Joe Kutner
Software’s Giant Leap
14 min readApr 10, 2019

This is Part 3 of the Software’s Giant Leap Series

Mission control sent only two messages to the Apollo 11 crew in the last minute before landing on the Moon — both of which told them how little fuel they had left. Gene Kranz, the flight director, knew that Neil Armstrong’s concentration was paramount as he piloted the Lunar Module (LM) past a boulder field. Any distraction would cost precious seconds.

Equally important was the computer’s concentration. Its most important job was performing the calculations that controlled the craft‘s rate-of-descent, but each signal from the LM’s sensors generated an interrupt, which took time to process and distracted the computer. Managing interrupts efficiently was key to the computer’s successful operation.

In the previous installment of this series, you learned how to run an Apollo Guidance Computer (AGC) emulator and debug an interrupt. As you’ll recall, an interrupt tells the operating system to stop what it is doing and handle some event. An event can be a button click, incoming sensor data, a timer, or a request to run a new job.

In this post, you’ll learn how the AGC operating system handled interrupts with mechanisms that are still in use today. In fact, they’re used to handle every click, swipe, and tap on the device you’re reading these words. Later on, we’ll look at what Linux inherited from the design of the AGC. But we first need to get to the bottom of the 1201 and 1202 alarms discussed earlier in the series because they will illustrate why these mechanisms are important and why Houston could report “We’re Go on that alarm” to the crew.

We’ll continue our investigation by reading some of the AGC source code. Even if you’ve never programmed before, you’ll learn how the AGC handled interrupts, performed context switching, and scheduled jobs.

Exploring the AGC code

You can browse the AGC source code in either the virtualagc Github repository, or the virtual machine running your AGC emulator (you’ll have this installed if you chose to follow the deep dive in the previous installment of this series). Both the Github repo and the emulator contain several different versions of the code. Some versions never flew in space, some were only used in training, and some correspond to the different versions that ran on of each the Apollo missions. The version we’ll explore is Luminary099, which ran on the LM during Apollo 11.

Open the Luminary099 directory by clicking the link in the Github repo, and you’ll see its contents as show here:

The Luminary099 directory contains ninety-two .agc source code files. The original code was stored on paper in a binder the size of a suitcase, but a group of dedicated enthusiasts have scanned every page into these digital files.

Now we need to find the part of this code that generates the alarm in question by using Github to search for the string “1202”. Among the many results, you’ll see one that looks like this:

Search result showing a reference to 1202 in the EXECUTIVE.agc

This result is found in the EXECUTIVE.agc file, which was a part of the AGC’s operating system (OS). In modern computers, the OS is usually a separate piece of software from the applications that run on it. But in the AGC, the OS was more tightly coupled to the programs that provided the essential functions of the mission (there was no App Store for the AGC).

The EXECUTIVE scheduled large blocks of software, called jobs, in order of priority. It was combined with another program, called WAITLIST, which executed short tasks that had to occur exactly at a given time. Typically, a WAITLIST task would take some measurement (such as a radar signal) and schedule a job in the EXECUTIVE to do further processing of that value (such as calculating Delta-H).

Let’s look at the code around our search result in EXECUTIVE.agc. Click the result and scroll down to line 206. You’ll see this:

Even without being able to read the AGC programming language, we can begin to infer what may have triggered the 1202 alarm.

The NEXTCORE label on line 206 defines a location in the program called a subroutine. Other parts of the program can call this subroutine with the TC command (for Transfer Control).

NEXTCORE supports one of the most important functions in the EXECUTIVE: scheduling more than one job at a time. Unlike WAITLIST tasks, which were short (no more than five milliseconds), EXECUTIVE jobs could be any size. To prevent a very long job from blocking all other jobs, the AGC allowed jobs to be interrupted and paused by higher-priority ones. When that happened, information about the paused job had to be preserved so it could be reloaded when the job began executing again. The memory used to preserve a job’s information was a group of twelve registers called a core set.

When the EXECUTIVE schedules a new job, it calls NEXTCORE to allocate a core set. There were eight core sets in the AGC on the Apollo 11 Lunar Module, which meant only eight jobs could be scheduled at once.

Each new job in the AGC required a core set for its data. The NEXTCORE subroutine handle the allocation of these core sets.

But what happens when a ninth job needs to be scheduled? If no core sets are available, NEXTCORE will fall through to the last two lines of the subroutine. The TC instruction on line 212 calls another subroutine, BAILOUT1, which tries to rescue the program by stopping all nonessential jobs and restarting only the highest priority jobs. Don Eyles compared this to “phoning in a bomb scare”.

The last line in NEXTCORE is a call to the OCT pseudo-operation, which places the 15-bit octal value 1202 directly into the byte stream. This value will ultimately be displayed on the DSKY when the operator enters the command V5N9E.

The code for the BAILOUT1 subroutine can be found in the Alarms and Abort section of the code, which you can find on Github. BAILOUT1 allowed the EXECUTIVE to discard low priority jobs, such as the the Verb 16 Noun 68 that Aldrin keyed-in, and continue running higher priority jobs like the SERVICER, which issued throttle and attitude commands (eliciting those exclamation points in the mission transcript). This is why the DSKY went blank — feeding data to the display was one of the lowest priority jobs.

At a high level, each event (like Aldrin’s V16N68 or a signal from a radar) triggered the following sequence:

  1. An interrupt occurs, and the the current job pauses
  2. The EXECUTIVE tries to schedule a new job and invokes NEXTCORE
  3. NEXTCORE finds that all core sets are consumed and invokes BAILOUT1
  4. BAILOUT1 triggers a software restart of the AGC, which discards the low priority jobs and allows EXECUTIVE to continue running the high priority jobs.

This explains the behavior the astronauts saw, but it doesn’t explain why their weren’t enough core sets. The computer was able to cope with the excessive load by invoking BAILOUT1, but we don’t know why the load existed in the first place. Something was “stealing time” from the processor by generating so many interrupts that the AGC couldn’t do everything it was asked to do.

Something is Stealing Time

The team in Cambridge spent the night of July 20, 1969 pouring over the telemetry data, reviewing radio communications, and trying to reproduce the alarms. But the man who figured it out was watching the landing on television in his living room. George Silver called the phenomenon “cycle stealing” and he had seen it happen twice in more than 100 tests of the AGC. When he realized he was seeing the same problem on television, he called Mission Control and drove into the MIT office.

In a 1968 test of the Apollo 9 Lunar Module (LM), Silver had traced the problem to the LM’s Rendezvous Radar (RR). The RR measured the craft’s speed and distance from the Command Module using essentially the same principle as a radar gun: bouncing a signal off a object and measuring the frequency shift of the returning signal. The RR fed into a component called the Coupling Data Unit (CDU), which converted its analog signals to digital signals for the AGC.

Unfortunately, the signal from the RR was out of phase (i.e. the wave form of their electrical signals did not match) with the signal the CDU used as a reference point because each component had a different power supply. This phase shift led to inaccurate position measurements, which the CDU detected as errors and attempted to correct by sending a request to the AGC. But because of the phase shift, its correction attempts were futile and the CDU began to send requests at its maximum rate of 6,400 per second.

source: https://doneyles.com

Each pulse from the CDU to the AGC triggered an interrupt that forced the EXECUTIVE to stop what it was doing and find a core set for the new job. Eventually, this overwhelmed the machine, and BAILOUT1 was invoked.

In other words, the radar was spamming the computer with requests.

After the mission, the team at MIT delivered an exegesis report describing the alarms and their cause. The report identifies the circumstances related to the RR, but also calculates the exact loss of computation time (called TLOSS) that the AGC experienced. With the CDU sending requests at its maximum rate, the AGC lost about 150 milliseconds per second (15% TLOSS). The team was able to calculate this number because they had investigated the impact of TLOSS well before the mission began. They even even designed the AGC to handle a 10% TLOSS during landing. Unfortunately, no one anticipated such a high rate of requests, which the report identifies as a failure in communication. The author wrote:

There were folks who knew about the RR resolver interface mechanism. There were also a great many people who knew the effect which a 15% TLOSS would have on the landing program’s operation. These folks never got together on the subjects.

The analysis of the interactions between the RR, the AGC, and the DSKY illustrates just how important the Apollo software became to the entire mission. It mediated the astronauts’ interactions with the system, incorporated data from numerous sensors, and integrated complex components built by a variety of contractors into a coherent machine. Originally an afterthought, the software became the central hub of the entire system.

Today, software plays an essential role in nearly every complex machine — many of which use the same mechanisms developed for the AGC. Let’s take a look at one example by comparing how the AGC handled interrupts and job scheduling with the mechanisms used in a modern computer running Linux.

Interrupt Handling in Modern Computers

Linux is one of the most popular operating systems in the world. Even though you may never have installed it or owned a computer that runs it, there’s a nearly 70% chance the words you’re reading now are hosted on a server that’s running Linux.

Like most modern operating systems, Linux uses interrupts to handle events and switch between jobs. We can watch those interrupts as they occur by inspecting some special files the OS creates. You’ll need a Linux machine to do this, but the virtual machine you’ve used to run the AGC emulator happens to use Linux. Alternatively, you can run the commands that follow on any Linux machine available to you.

Open a terminal in your virtual machine, and run this command to watch the contents of a file called /proc/interrupts as it changes:

$ watch -n1 "cat /proc/interrupts"
CPU0 CPU1
0: 45 1 IO-APIC 2-edge timer
1: 0 18581 IO-APIC 1-edge i8042
8: 0 0 IO-APIC 8-edge rtc0
9: 0 11 IO-APIC 9-fasteoi acpi
12: 0 35598 IO-APIC 12-edge i8042
14: 0 0 IO-APIC 14-edge ata_piix
15: 0 111920 IO-APIC 15-edge ata_piix
18: 0 0 IO-APIC 18-fasteoi vboxvideo
19: 60376 188 IO-APIC 19-fasteoi ehci_hcd:us...
20: 0 168087976 IO-APIC 20-fasteoi vboxguest
21: 0 53788 IO-APIC 21-fasteoi 0000:00:0d....
22: 406 28 IO-APIC 22-fasteoi ohci_hcd:usb2
NMI: 0 0 Non-maskable interrupts
LOC: 33558972 66395926 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 0 0 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 9482712 5066797 Rescheduling interrupts
CAL: 19007 233 Function call interrupts
TLB: 89910 77881 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
DFR: 0 0 Deferred Error APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 383 383 Machine check polls
PIN: 0 0 Posted-interrupt notification event
PIW: 0 0 Posted-interrupt wakeup event

The command displays a table containing the different types of interrupts including timers, function calls, and schedulers. Each second, it will update with new values. Each increment of those values represents an instance when the computer had to stop what it was doing, handle an event, and then resume the previously running process — just like in the AGC.

In Linux, and most modern operating systems, the process of handling an interrupt causes a context switch, which is essentially the same mechanism the AGC used even though the term was not common in the 1960s. When an interrupt occurs, the operating system saves the context of the currently running process so it can be reload at a later time.

We can inspect context switches in much the same way as interrupts. First, we’ll need the ID of a process that we can inspect. We’ll use the terminal session process, which is called “bash”. You can get its ID by running the ps command:

$ ps
PID TTY TIME CMD
9506 pts/8 00:00:00 ps
15593 pts/8 00:00:00 bash

In this case, the ID is 15593, but it will likely be different for you. Now use that ID in the following command:

$ cat /proc/15593/status
Name: bash
State: S (sleeping)
Tgid: 15593
...
voluntary_ctxt_switches: 386
nonvoluntary_ctxt_switches: 69

The last two lines of the output show how many context switches have been performed since this process started running. A voluntary context switch happens when the process has nothing else to do or is waiting on some external input (like the response from a network request). A non-voluntary context switch occurs when the operating system decides the process needs to be interrupted so the CPU can do something more important (we’ll discuss how this decision is made in a moment). In either case, the OS saves the state of the process, like the AGC did.

Instead of using core sets like the AGC, Linux uses a data structure called a Process Control Block (PCB). This structure contains all the necessary information for representing process, including the process state (new, running, sleeping, etc), the process ID, the program counter (the address of the process’s next instruction), scheduling and memory information, a list of open files, and various CPU memory values (similar to the core sets).

A Process Control Block is a structure an operating system uses to store information about a process.

One important difference between Linux and the AGC is how the operating systems decide when a context switch will occur. This is determined by their process scheduling algorithms.

Process Scheduling in Apollo and Linux

Like the AGC, all modern computers need to give the appearance of doing many different things at the same time when in reality they can only do one thing at a time (most modern personal computers can actually do two or four things at a time, but the same principle is true).

The few computers that existed in the early 1960s handled concurrent jobs by allocating a slice of time to each program and indiscriminately switching between them many times per-second. This often meant that very long jobs would never get enough time to finish, and very short jobs could take a very long time to finish.

Early on in the Apollo project, Hal Laning (the designer of the Whilrwind computer and the first compiler) decided that this fixed-time-slice mode of scheduling jobs (sometimes called “round-robin”) would not be adequate for spaceflight. Instead, he designed what is called priority scheduling. A priority scheduling algorithm associates a priority value to every job, and schedules the jobs with the highest priority first.

The AGC was one of the first computers to use a priority scheduling algorithm. Each interrupt preempted a job (i.e. stopped it without cooperation from the job itself) when something more important needed to execute. For example, a higher priority task, a button press, or a radar signal will cause a context switch. When the interrupt is completely handled, the computer would reload the job with the highest priority, which would not necessarily be the same job that was running before the interrupt.

In Linux, important events like a mouse-click will cause a context switch, but there are other triggers too. A process can voluntarily yield to a context switch or the operating system can decide that the current process has had enough time to run and interrupt it. This is called fair scheduling and it allows processes to share the CPU more equally.

The scheduler orders jobs into a queue based on priority and other criteria. Then it selects a job from the queue to run on the CPU.

Linux uses priority values in its algorithm, but not as directly as the AGC. Instead, it determines the time a task is permitted to execute based on its priority in combination with other factors like how much time has elapsed since it was created. In general, lower-priority tasks are given less time while higher-priority tasks are given more.

Linux uses fair scheduling because it has different goals than the AGC. A Linux machine will have multiple users — maybe thousands. And they all need to feel as though the system is dedicated to their own processes. The AGC on the other hand had to make sure the astronauts didn’t die, which meant some jobs need to run at all costs.

Despite this difference, priority scheduling is an essential part of the modern computer experience. Without it, our jobs and processes would queue up to get time on a mainframe computer. In many ways, your smartphone and laptop owe the AGC designers a debt of gratitude.

Ascending to Lunar Orbit

After the team at MIT confirmed George Silver’s hypothesis, they sent their findings to Houston. Only nineteen minutes before the scheduled liftoff, Mission Control gave Aldrin and Armstrong the instructions they needed to avoid the problem reoccurring on ascent.

“We want to make sure you leave the rendezvous radar circuit breakers pulled to avoid a computer overload,” they told the astronauts.

Pulling the breakers was necessary because even with the rendezvous radar in passive mode, it could still trigger the interrupts that resulted in the 1202 and 1201 alarms.

The consequence of disabling the rendezvous radar, which measured speed and distance between the LM and CM, was that the AGC could not incorporate that data into the guidance equations. The AGC would need to navigate the LM into lunar orbit using only mathematics and inertial sensors, but the team at MIT and Mission Control were confident in its capabilities.

The crew made it back to the CM, of course, and the mission was a success. But without the preemptive priority-based scheduling algorithm of the AGC, this may have been a very different story.

The next time your computer freezes up and you start rapidly clicking your mouse to see if anything happens, remember that each click generates an interrupt that steals more time from the computer — just like the rendezvous radar on Apollo 11.

References

Now that you’ve read this account of the Apollo 11 landing, try watching it in real-time:

An annotated portrayal of the Apollo 11 landing on the Moon with captions to explain the various stages of the descent.

--

--

Joe Kutner
Software’s Giant Leap

I’m an architect at Salesforce.com who writes about software and related topics. I’m a co-founder of buildpacks.io and the author of The Healthy Programmer.