The 21st anniversary of the USS Yorktown incident offers an opportunity to reflect upon computer system defects, human error, organization flaws, and the best principles and practices for solution delivery in the information technology industry. In this blog and my upcoming book, Bugs: A Short History of Computer System Failure, I will chronicle some important system failures in the past and discuss ideas for improving the future of system quality. As IT becomes increasingly woven into Life, the quality of hardware and software impacts our commerce, health, infrastructure, military, politics, science, security, and transportation. The Big Idea is that we have no choice but to get better at delivering technology solutions because our lives depend on it.
On 21 September 1997, the USS Yorktown halted for almost three hours during training maneuvers off the coast of Cape Charles, Virginia due to a divide-by-zero error in a database application that propagated throughout the ship’s control systems. The Yorktown had successfully served the US Navy since 1984 without a major incident during multiple combat operations; however, as part of an IT modernization program dubbed Smart Ship, its control systems were modified in 1996 to use a network of PC’s running Windows NT 4.0. This essay will examine the details of the software error that stopped the Yorktown and discuss the IT matters that contributed to the system failure.
The USS Yorktown (CG-48) was launched on 17 January 1983 and commissioned on 4 July 1984. A Ticonderoga-class cruiser, the Yorktown was designed to use the American Aegis system which integrated computers with radar data to track and guide weapons. Weighing nearly 10,000 tons and spanning 173 meters in length, the Yorktown supported a variety of armaments including two Mark-26 surface to air missiles (SAMs), eight RGM-84 Harpoon anti-ship/anti-submarine missiles, two Mark-32 torpedo tubes, and four lightweight mounted guns. Propelled by four General Electric gas turbine engines with 80,000 horsepower, the Yorktown could reach speeds greater than 30 knots. The ship could also carry two Sikorsky Seahawk helicopters, the Navy relative of the US Army’s Blackhawk. The Yorktown’s weapons, aerial resources, speed, and crew of 33 officers and more than 340 enlisted personnel made it one of the US Navy’s most versatile military units on the water’s surface, able to support carrier battle formations, amphibious assaults, escort missions, and interdiction assignments.
The Yorktown ship was built in sections, called modules. The modules were connected together to form the ship’s hull; once done, the deckhouse sections were then lifted aboard. During module component construction, hundreds of subassemblies were made and equipped with piping, ventilation ducts, and other shipboard hardware. These subassemblies were then joined to make the modules, which were then outfitted with larger equipment, such as electrical panels as well as propulsion and power generation machinery. At the Ingalls shipyard in Pascagoula, Mississippi, this modular process is supported by a Computer-Aided Design (CAD) and Manufacturing program; the CAD system directs the operation of digital equipment used to cut steel plates, cut and bend pipes, and form sheet metal assemblies. For launching, the ship was moved across land several hundred yards by a wheel-on-rail transfer system to the floating dry dock. The dock was then moved towards the water’s edge and ballasted down in order for the ship to float free; thereafter the ship was moved to an outfitting berth in preparation for the traditional christening ceremony. Upon completion of post-launch outfitting, the ship went through extensive dockside and at-sea testing to ensure the ship and crew were ready to work safely at sea. Litton Industries needed 15 months to manufacture the Yorktown; the ship cost approximately $1 billion to build and $28 million to operate annually.
In its first deployment during 1985–1986, the Yorktown undertook several successful expeditions including the interception of the Achille Lauro hijackers, two Black Sea excursions, and three military operations off the Libyan coast. During its second and third deployments, it participated in several US and NATO exercises around Europe and the Mediterranean including the “right of innocent passage” on 12 February 1988 in Soviet territorial waters which triggered a collision incident with the Soviet frigate, Bezzavettny, that some observers have called the “last incident of the Cold War”. It also played a key role in Operation Provide Comfort to Kurdish refugees during the US Iraqi War of 1992. In its fourth and fifth deployments spanning the years of 1993–1995, it served in counter-narcotic operations in the Caribbean as well as UN sponsored actions related to the war in Yugoslavia. By this time, the Yorktown and its crew had earned awards for naval gunfire support (1987), electronic warfare excellence (1991), sustained combat readiness (1992), and superior safety (1993).
In October 1995, the US Naval Research Advisory Committee (NRAC) published a paper recommending approaches to reduce manning; the report’s thesis was that culture and tradition were the obstacles to lower staff and lifecycle costs — not technology. Thus in December 1995, the US Navy established the Smart Ship Program Office (SSPO) to pursue the goal of reducing staff while maintaining combat readiness through new technology and process changes. So-called “smart ships” would consist of several new systems to automate navigation, monitor equipment sensors, control machinery and fuel, and communicate over both fiber optic and wireless networks. The SSPO chose the Yorktown as its first testbed, and by December 1996, the ship was equipped with the first prototype of the Smart Ship System. The system was designed and built by a subsidiary of Litton Industries; it consisted of a Local Area Network (LAN) of 27 client PC’s communicating over fiber optic cable with a server. All the Smart Ship machines ran Microsoft Windows NT 4.0. The system was projected to save $2.8 million per year by reducing manual operations and maintenance costs associated with shrinking the ship’s staff by 4 officers and about 40 enlisted personnel. In May 1997, the Yorktown with reduced crew successfully completed a five month deployment serving in Caribbean counter-narcotic operations as well as performed test exercises alongside the USS George Washington in her carrier battle group. The Navy Man Power Analysis Center (MPAC) and Operational Test and Evaluation Force (OPTEVFOR) groups subsequently reviewed the Yorktown’s crew and ship capabilities and concluded that the ship could meet its operational requirements.
On 21 September 1997, the USS Yorktown was performing training exercises off the coast of Cape Charles, Virginia when a crew member began troubleshooting a fuel valve that was physically closed, but according to the Smart Ship’s Standard Machinery Control System (SMCS) was open. The technician tried to digitally calibrate and reset the fuel valve by entering a 0 value for one of the valve’s component properties into the SMCS Remote Database Manager (RDM). The RDM program then attempted to perform a division operation by the valve property; a divide-by-zero arithmetic exception was thrown, not caught by the program, and the RDM crashed. Since other Smart Ship systems were dependent on RDM availability across the LAN, these other SMCS components including ones controlling the motor and propulsion machinery began to fail in a domino-like sequence until the ship stopped dead in the water. The crew was able to troubleshoot and restart the ship’s systems after two hours and forty-five minutes, and the Yorktown returned to base in Norfolk, Virginia.
There are conflicting reports on several aspects of the Yorktown incident that we shall explore now. One controversy is whether the Yorktown returned back to base on its own or was towed by another vessel. Anthony DiGiorgio, a civilian contract engineer with 26 years of experience working on naval control systems in the Atlantic Fleet Technical Support Center, initially stated that the ship had been towed in a critical article he penned in the June 1998 issue of the Naval Institute Proceedings (NIP) journal. His account was later disputed by Captain Richard Rushton, the commanding officer of the Yorktown; Rushton stated that the Yorktown had two emergency power units that were activated when the propulsion system failed. Rushton also indicated that similar program crashes had occurred twice since the Smart Ship installation due to incorrect values entered into the RDM; in each case, the ship’s systems were restarted, RDM values were reset, and the ship performed as expected and required. In the same June 1998 article of the NIP, DiGiorgio elaborated further on the incident and alleged that the ship needed two days of pierside repairs and maintenance. However, in the August 1998 issue of the the Government Computer News (GCN) magazine, DiGiorgio retracted his earlier declaration and suggested that GCN reporters had altered his original story; GCN published a statement standing by its journalists and the original narrative. While Rushton’s story does explain what we know, the retraction by DiGiorgio suggests intense organizational and political pressure to suppress the full facts of the story. The other major point of contention was whether usage of Microsoft Windows NT 4.0 contributed to the failure. Windows NT 4.0 was selected in March 1997 as the standard OS for both networks and PC’s as part of the Navy’s Information Technology for the 21st century initiative (IT-21). Bill Gates even went so far as to nominate the Smart Ship program to the Computer World and Smithsonian Awards. DiGiorgio assigned some blame to Windows NT in his June 1998 article, stating that “using Windows NT… on a warship is similar to hoping that luck will be in our favor.” Ron Redman, deputy IT director of the Aegis Program Executive Office, is quoted in a June issue 1998 of GCN as saying that “UNIX is a better system for control of equipment and machinery, whereas NT is a better system for the transfer of information and data. NT has never been fully refined and there are times when we have had shutdowns that resulted from NT.” However, Rushton again defended the Navy choice of operating system, stating that “NT was never the cause of any problem on the ship. The problems were all in programs, databases, and code within the individual pieces of software we were using.” Based on the public record, while there was organizational pressure to use Windows NT, this decision does not appear to have directly contributed to the Yorktown incident notwithstanding the negative comments from DiGiorgio and Redman.
The US Navy CIO office commissioned an inquiry led by Ron Turner into the USS Yorktown incident. Although the Navy’s investigation report was not made public, several lessons can be learned from this event that are useful for software development professionals.
- Software applications should validate input data before processing it. Such validation can prevent operator error, maintain system reliability as well as enforce security controls. The recommendation to validate input data comes from multiple trusted sources including the Top 25 List for Preventing Software Vulnerabilities by CVE, OWASP and SANS, the Pernicious Kingdoms paper by Tsipenyuk, Chess, and McGraw, and the Building Secure Software book by McGraw and Viega. Several serious problems can result from not validating data such as arithmetic overflow, buffer overflow, command/data injection, cross-site scripting, and remote process control. Had the RDM program validated the 0 value for the fuel valve component before storing it, then the Yorktown might not have stalled.
- Software programs should also catch and handle exceptions. Exception handling is the process for detecting and reacting to computational anomalies, events or exceptional conditions that require a different flow than normal instruction execution. A hardware example of exception handling is the IEEE 754 standard for floating point arithmetic which defines values for various exceptions (e.g. infinity for divide-by-zero), provides status flags for later checking as to what kind of exception occurred, and then offers the optional registration of a user-defined (e.g. OS, compiler) routine. Operating systems use interrupts, signals, and traps to notify kernel and user-level software programs about I/O events from the disk, network, and audio/video devices as well as errors related to the circuit board, thermal components, out-of-bound references to arrays or pointers, bad memory addresses such as a null pointer, stack faults, mis-aligned memory, arithmetic overflow from integer and floating point units, and divide by zero. The default exception handling behavior of most language compilers is to search and unwind the stack of functions until a suitable exception handler is found to execute; if no exception handler is found, then the program is terminated with the appropriate OS error code. Modern programming languages have offered constructs to throw and catch exceptions since LISP did so in the 1970s; these language features are found in Ada, C++, Java, Python, and Smalltalk that were all available at the time that the SMCS was developed by Canadian Aviation Electronics (CAE). CAE is a technology provider of simulators, control systems, and training services to the aerospace and defense industries, earned more than $2.7 billion CAD of revenue in 2017, and is publicly traded on the TSX and NYSE stock markets. Again, had CAE used a modern programming language with exception handling to develop the RDM system, then perhaps the ship’s SMCS would have reliably dealt with the divide-by-zero error and not crashed.
- Software system components should be fault tolerant as reasonably effective as possible. The dependency graph amongst Smart Ship components meant that an active fault in a foundational one such as the SMCS could result in failure for all. Comparing the Yorktown’s fragility to the more robust arithmetic of a $2.95 pocket calculator that could survive a divide-by-zero error, DiGiorgio expressed surprise “that the computers on the Yorktown were not designed to tolerate such a simple error” and “that there is very little segregation of error when software shares bad data”. Accordingly, one has to balance the decoupling of and resilient engineering of distributed system components with their complexity and cost. One approach might have been to replicate and cache the RDM in a read-only state across the LAN so that the last known safe state would have been available to a local process as a second option in case the primary RDM hosted on the LAN was down; this design improves system reliability, but also could introduce a data consistency problem. Of course, a simpler approach would have been for the RDM to validate the input data and handle exceptions if they occurred. The ship hardware system engineers did have the foresight to include the redundant power units that were able to support the ship’s return to base. Such backup components are essential for eliminating single points of failure and improving reliability in a larger system.
- Another contributing factor to the incident was the lack of time to design, construct, install, and test the software. Rushton stated that “we pushed the envelope and knew that events such as what happened… were possible”. DiGiorgio also recommended more upfront engineering in his NPI article writing that “installing a control system on a warship and resolving problems as the project progresses is a costly and naive process.” Nevertheless, one has to weigh the benefit of these activities with their cost. Furthermore, the project management movement from waterfall processes to more Agile methodologies, earlier prototypes, and smaller iterations suggests that the Navy was balancing caution and progressive experimentation in a pragmatic manner with the year long construction phase mitigated by six months of at-sea testing. The goal of the Smart Ship program was to reduce the operating costs of Navy ships without compromising combat readiness, and one can appreciate the conflict of interests that might have curtailed delivery time.
CAE corrected some of the aforementioned SMCS software issues, and the Yorktown underwent further testing after updates to its SMCS, returning to active service over a year later. On 25 September 1999, the ship departed Pascagoula for a four month counter-narcotics deployment in the Caribbean, and it served without problems. The Yorktown’s final mission was patrolling the Persian Gulf from February to August of 2004; the ship assisted with protecting Iraqi oil terminals and conducting other maritime security operations. Approaching the end of its 35-year estimated life cycle, the Yorktown was decommissioned and struck on 10 December 2004; since that time, it has been berthed at the Naval Inactive Ships Maintenance Facility in Philadelphia, Pennsylvania, but not yet scrapped.
On the other hand, the SSPO ambitions and budget faced considerable scrutiny after the Yorktown incident. The original contract awarded to Litton Industries was worth $138.6 million USD, and the program’s scope was installing the Smart Ship System on all 27 CG-47 cruisers. The Smart Ship System was installed on the USS Ticonderoga (CG-47) after taking 70 weeks and costing the company $30 million USD. Contract renegotiations took place, and under the amended agreement, Litton would complete work on two additional ships already started with an option to build four more. Litton engineers were able to install the system on the USS Monterey in 20 weeks, but the costly lessons of complexity and retrofitting new technology on legacy systems meant that the SSPO’s transformation goals would have to await a new generation of Navy ships.