Guest post: An Introduction to STPA and its Application to Safety-Critical Technologies

Ike
Ike Blog
Published in
5 min readOct 10, 2019

This guest post was written by Dr. John Thomas, Director of MIT’s Safety and Cybersecurity Group and MIT’s Engineering Systems Lab. Dr. Thomas’s research studies engineering mistakes and human error to develop systematic methods to prevent them. His primary research is focused on STPA and related techniques for proactive system safety, security, and human factors engineering. His background over last 20 years includes work in autonomous vehicles (land, air, and space), automotive advanced driver assistance systems (ADAS), aircraft automation, medical devices, nuclear power plants, oil & gas, and others. Dr. Thomas teaches classes on software engineering, system engineering, system safety, cybersecurity, human factors, and system architecture. He serves on automotive industry committees responsible for international standards including ISO 26262 (Automotive Functional Safety) and ISO/PAS 21448 (SOTIF), as well as STPA-specific automotive safety standards like SAE J3187.

System-Theoretic Process Analysis (STPA) is an increasingly popular hazard analysis method developed at MIT for modern complex safety-critical systems. While traditional techniques focus on individual component failures, faults, and combinations thereof, more complex safety-critical systems often exhibit unsafe and undesirable behavior that does not involve any component failures or was never anticipated by failure-based analysis. Components, especially software components, may operate exactly as designed and may perform their intended function perfectly at the component level while their interactions often lead to unexpected, dysfunctional, or unsafe system-level behavior. This often occurs when engineering assumptions are incorrect or become violated, requirements are incomplete or otherwise flawed, components behave in conflicting or otherwise unanticipated ways, and when human interactions are not fully understood or anticipated.

The Mars Polar Lander accident is one example of a loss caused by interactions among individual components working exactly as designed and specified. The intended landing sequence involves a parachute that is designed to slow the spacecraft, landing legs that are designed to drop into position to prepare for landing, and descent engines that are designed to operate until the moment of touchdown. The descent engines are controlled by an onboard computer, which is designed to immediately shut down the engines when sensors on the landing legs detect a physical force consistent with a touchdown.

Figure 1: Mars Polar Lander Entry, Descent, and Landing Sequence. Source: Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions, March 2000. https://spaceflight.nasa.gov/spacenews/releases/2000/mpl/mpl_report_1.pdf

As a result of these components all working exactly as designed and as specified, the computer shut down the engines prematurely and the spacecraft crashed into the surface. The problem that was overlooked was that the sensors, which had to be designed to be fairly sensitive in order to detect touchdown events, would also detect the same physical force during leg deployment prior to touchdown. This expected behavior was not addressed in the software requirements. Perhaps it was not addressed because the software was not originally planned to be operating at this time, but the software engineers decided to start the process earlier than originally planned in order to even out the load on the processor.

Obviously, nobody intended for the spacecraft to crash, and nobody specified at the system level that the spacecraft ought to crash or prematurely shut down the descent engines. However, the interactions between many components, all working exactly as designed and all performing their individual component functions as intended, together did not manage to achieve the overall system goals and ultimately resulted in system behavior that was obviously not intended. This is a difficult problem in engineering because component-based and failure-based techniques, which had been to analyze the Mars Polar Lander before this accident, tend to overlook accidents that arise from components functioning and interacting as intended. It’s also difficult to identify with testing alone — unless the flaw is already known, it’s hard to know whether a test plan will cover all of the scenarios that matter. There is a need for systems-based methods that can identify these potential flaws, produce the correct requirements and specifications, and generate critical test scenarios that challenge the system.

STPA provides a powerful way to anticipate and address component interaction accidents as well as component failure accidents. STPA uses a model called a control structure to determine how controls, feedback, and other interactions between failed or non-failed components can lead to accidents. STPA treats safety as a dynamic control problem rather than a failure prevention problem, and the emphasis is on enforcing constraints on system behavior rather than simply preventing individual failures. STPA is especially powerful when applied to complex software-intensive systems or systems with the potential for unsafe or unexpected human interactions.

The four basic steps STPA are shown in Figure 2 along with a graphical representation of these steps.

Figure 2: Overview of the basic STPA method. Source: Dr. John Thomas, MIT

Defining the purpose of the analysis is the first step with any analysis method. What kinds of losses will the analysis aim to prevent? Will STPA be applied only to traditional safety goals like preventing loss of human life or will it be applied more broadly to security, privacy, performance, and other system properties? What is the system to be analyzed and what is the system boundary? These and other fundamental questions are addressed during this step.

The second step is to build a model of the system called a control structure. A control structure captures functional relationships and interactions by modeling the system as a set of feedback control loops. The control structure usually begins at a very abstract level and is iteratively refined to capture more detail about the system. This step does not change regardless of whether STPA is being applied to safety, security, privacy, or other properties.

The third step is to analyze control actions in the control structure to examine how they could lead to the losses defined in the first step. These unsafe control actions are used to create functional requirements and constraints for the system. This step also does not change regardless of whether STPA is being applied to safety, security, privacy, or other properties.

The fourth step identifies the reasons why unsafe control might occur in the system. Scenarios are created to explain:

1. How incorrect feedback, inadequate requirements, design errors, component failures, and other factors could cause unsafe control actions and ultimately lead to losses.

2. How safe control actions might be provided but not followed or executed properly, leading to a loss.

Once scenarios are identified, they can be used to create additional requirements, identify mitigations, drive the architecture, make design recommendations and new design decisions (if STPA is used during development), evaluate/revisit existing design decisions and identify gaps (if STPA is used after the design is finished), define test cases and create test plans, and developing leading indicators of risk during operations.

Today, STPA is being used widely in many industries and applications including autonomous vehicles, Advanced Driver Assistance Systems (ADAS), unmanned vehicles in space, air, sea, and land, nuclear power plants, chemical plants, oil & gas facilities, automated train control systems, and many other safety-critical applications. It is being referenced explicitly in international standards like ISO/PAS 21448 (SOTIF) and RTCA DO-356A (cybersecurity). Several automotive suppliers have already described their success in applying STPA for hazard analysis of autonomous cars and ADAS applications.

It’s exciting to see Ike beginning to use STPA for a brand new application — automated trucks — and the progress they’re making in this challenging field!

--

--