Resilience Engineering Process

What do Systems Theory, Explainable Artificial Intelligence, Cybersecurity, and Safety Engineering all Have in Common?

Nicolas Malloy
The Interlock
10 min readAug 17, 2019

--

Photo by Ruby Lane Vintage

When it comes to Resilient Systems — Rubber balls are the Original Gangstas.

Resilient, resilience, and resiliency. Boy do those terms get thrown around a hell of a lot. A few weeks ago I was pretty amused and sort of shocked to hear resilience used in a toothpaste commercial. Ah yes, buzzwords. You’ve gotta love em’. But this got me thinking about the term, what it really means versus how it’s used.

I’d like to get it all out on the table by first proposing that system safety and cybersecurity engineers are leading the charge on resiliency. Their work is foundational. In fact, if you’re planning to build a resilient system and haven’t brought one of these folks on to your team you’re off to a bad start right out of the gate.

Characterizing Resiliency

Photo by YWRC

The standard definition of resilience is the capacity to recover quickly from difficulties; toughness. This is pretty solid in general terms but what about the resilience of complex systems? If you were to consider the resiliency of a Fighter Jet what characteristics come to mind?

In my opinion (and I’m totally open to discussion on this) resilience refers to how well a system can handle disturbances and variations that fall outside of its designed adaptive mechanisms. Taking this a bit further we can characterize resiliency as the union of disturbance and adaptation within a systems operating space. In a perfect world truly resilient systems can handle both anticipated and unanticipated disturbances. This is the essence of adaptation.

Photo by The Interlock

Of course designing such a solution is more easily said than done but there are a number of techniques design engineers can use to maximize the resiliency of the system they intend to create. System Safety and Cyber-Security are two engineering disciplines whose sole purpose in life is to identify system related scenarios leading to loss. Loss of Life, personal data, and money are some of the heavy hitters. We’ll get to the most popular techniques shortly.

But first, to set the mood let’s consider an autonomous driving system design and its need for a solution insusceptible to inadvertent control actions. This is certainly a desirable resilience feature. There are a number of losses that can be attributed to such an event. The first and perhaps most obvious is the loss of an expensive asset. This has financial ramifications. Another clearly more serious outcome is loss of human life. Fratricide may be a specific context. What if a pedestrian is struck as a autonomous car swerves?

Referencing a real world example — Tesla automobiles have an onboard autonomous driving system which controls steering. Many traditional auto manufacturers are close behind with their own versions of such systems. Yet Tesla, who is leading the pack has been under much scrutiny for a number of accidents involving the autonomous system they have created. What can be done to improve these systems and reduce the likelihood of serious accidents resulting in death from happening?

Loss prevention is achieved through thoughtful identification and management of the causes.

The causality of system loss can be tough to pinpoint. Often it consists of not a single contributing factor but many. As the permutations of loss and associated causality are teased from this event it quickly becomes apparent that our analysis could get messy. The problem is that we’re attempting to build systems that are beyond our intellectual ability to manage; increased complexity of all types makes it difficult for designers to consider all potential system states.

Losses result when system constraints are not enforced. Requirements and constraints are used in the design of the control structure at the organizational and social system levels above the physical system. They are responsible for the system’s ability to adapt. Resilience engineering design principles are used to determine the most suitable system response to disturbances. For instance — physical redundancy is a very common resilience engineering design principle found in complex systems. It ensures that a backup is available in the event of a system component failure. Additional design principles and descriptions are provided in the Table below.

Photo by Kenneth V. Stavish

A Survey of Resilience Engineering Processes and Techniques

Photo by Robert Bye on Unsplash

Resilience engineering seeks to identify the causality of system loss. By establishing and adhering to a consistent and repeatable process resilience engineering design principles can be used to determine how to best express a systems resiliency through its performance. Mirroring processes from the fields of System Safety Engineering and Cyber Security lead to more resilient system designs.

STAMP (Systems-Theoretic Accident Model and Processes) [1]

STAMP is an accident causality model based on systems theory and systems thinking. STAMP integrates into engineering analysis causal factors such as software, human decision-making and human factors, new technology, social and organizational design, and safety culture, which are becoming ever more threatening in our increasingly complex systems. [1]

STPA (Systems-Theoretic Process Analysis) [1]

STPA is a powerful hazard analysis technique based on STAMP, while CAST (Causal Analysis based on STAMP) is the equivalent for accident and incident analysis. These tools are increasingly used across diverse industry sectors. Application areas have included aviation, air traffic control, space, defense, the automotive industry, railways, chemicals, oil and gas, medical devices, health-care, and workplace safety, with a growing interest coming from new areas such as the pharmaceutical industry and the finance and insurance sectors. Ongoing developments aim at extending the application field of STPA to include security. [1]

STPA for Safety and Security [2]

STPA-SafeSec provides a single approach to identify safety and security constraints that then need to be ensured by the system in order to operate loss free. This single approach allows the interdependencies between safety and security constraints to be detected and used in mitigation strategies. Through this the most critical system components can be prioritized for in-depth security analysis (e.g. penetration testing). Furthermore, the results from the analysis show the potential system losses that can be caused by a specific security or safety vulnerability in the system; and lastly, mitigation strategies can be more readily designed and their effectiveness evaluated — changes in the physical process can be used to mitigate cyber-attacks, while control algorithms can mitigate safety limitations of the physical processes or devices. [2]

Safety-Security Assurance Framework (SSAF) [3]

SSAF is based on a core set of assurance principles. This is done so that safety and security can be co-assured independently, as opposed to unified co-assurance which has been shown to have significant drawbacks. This also allows for separate processes and expertise from practitioners in each domain. With this structure, the focus is shifted from simplified unification to integration through exchanging the correct information at the right time using synchronization activities. [3]

Fault Tree Analysis (FTA) [4]

FTA is a deductive analysis that begins with a general conclusion, followed by attempts to determine the specific causes of the conclusion by constructing a logic diagram called a fault tree. This is also known as taking a top-down approach. The main purpose of the fault tree analysis is to help identify potential causes of system failures before the failures actually occur. It can also be used to evaluate the probability of the top event using analytical or statistical methods. These calculations involve system quantitative reliability and maintainability information, such as failure probability, failure rate and repair rate. After completing an FTA, you can focus your efforts on improving system safety and reliability. [4]

Failure Modes and Effects Analysis (FMEA) [5]

FMEA is a structured way to identify and address potential problems, or failures and their resulting effects on the system or process before an adverse event occurs. In comparison, root cause analysis is a structured way to address problems after they occur. FMEA involves identifying and eliminating process failures for the purpose of preventing an undesirable event. FMEA is effective in evaluating both new and existing processes and systems. For new processes, it identifies potential bottlenecks or unintended consequences prior to implementation. It is also helpful for evaluating an existing system or process to understand how proposed changes will impact the system. [5]

Hazard and Operability Analysis (HAZOP) [6]

HAZOP is a structured and systematic technique for system examination and risk management. In particular, HAZOP is often used as a technique for identifying potential hazards in a system and identifying operability problems likely to lead to nonconforming products. HAZOP is based on a theory that assumes risk events are caused by deviations from design or operating intentions. Identification of such deviations is facilitated by using sets of “guide words” as a systematic list of deviation perspectives. This approach is a unique feature of the HAZOP methodology that helps stimulate the imagination of team members when exploring potential deviations. [6]

Event Tree Analysis (ETA) [7]

An ETA is an inductive procedure that shows all possible outcomes resulting from an accidental (initiating) event, taking into account whether installed safety barriers are functioning or not, and additional events and factors. By studying all relevant accidental events (that have been identified by a preliminary hazard analysis, a HAZOP, or some other technique), the ETA can be used to identify all potential accident scenarios and sequences in a complex system. Design and procedural weaknesses can be identified, and probabilities of the various outcomes from an accidental event can be determined. [7]

System Safety Process [8]

The core systems safety process involves establishing a System Safety Program to implement the mishap risk management process. The SSP is formally documented in the System Safety Program Plan, which specifies all of the safety tasks that will be performed, including the specific hazard analyses, reports, etc. As hazard are identified, their risk will be assessed, and hazard mitigation methods will be established to mitigate the risk determined necessary. Hazard mitigation methods are implemented into system design via System Safety Requirements. All identified hazards are converted into hazard action records (HARs) and placed into a hazard tracking system (HTS). Hazards are continually tracked in the HTS until they can be closed. [8]

Where is Resilience Engineering Headed and What are its Future Challenges?

Photo by Nathan Dumlao on Unsplash

You can’t build Resilient Systems without Resilience Engineering Process.

A process clearly rooted in the heuristics used by system safety and cybersecurity engineers. While not a single technique or process discussed in this article represents a golden bullet. Together they may point to a solution. Perhaps a unified theory lies in wait. One which captures the overlap representative of capabilities each shares.

I believe the greatest challenge to a unified theory is our rapid evolution of technology. Of which, developments in Artificial Intelligence (AI) are likely the most daunting. This is largely due to our inability to explain the results these systems generate. We call this a black box model.

Many questions arise. Some of which are, How will design engineers build mechanisms to cope with undesirable results based upon unknown causal factors? Bad decisions? Actions which may lead to catastrophe? I’ll talk about this in detail as the focus of another article.

Being able to determine why AI system decisions are made will allow for the development of system features which ensure the appropriate response to undesirable decisions. These are the next steps in resilience engineering. Explainable AI (XAI)could very well be the next iteration in resilience engineering process.

XAI [9]

Dramatic success in machine learning has led to a torrent of AI applications. Continued advances promise to produce autonomous systems that will perceive, learn, decide, and act on their own. However, the effectiveness of these systems is limited by the machine’s current inability to explain their decisions and actions to human users. The Department of Defense (DoD) is facing challenges that demand more intelligent, autonomous, and symbiotic systems. XAI — especially explainable machine learning — will be essential if future warfighters are to understand, appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners.

The XAI program aims to create a suite of machine learning techniques that:

  • Produce more explainable models, while maintaining a high level of learning performance (prediction accuracy); and
  • Enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners.

New machine-learning systems will have the ability to explain their rationale, characterize their strengths and weaknesses, and convey an understanding of how they will behave in the future. [9]

References

[1] Reykjavík University, https://en.ru.is/stamp/what-is-stamp/

[2] Journal of Information Security and Applications, Volume 34, Part 2, June 2017, Pages 183–196

[3] Johnson, Nikita & Kelly, Tim. (2018). An Assurance Framework for Independent Co-assurance of Safety and Security.

[4] Pilot, S. (2002). What is a Fault Tree Analysis? Quality Progress.

[5] API. (n.d.). Guidance for Performing Failure Mode and Effects Analysis with Performance Improvement Projects. Retrieved from CMS: https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/QAPI/downloads/GuidanceForFMEA.pdf

[6] Product Quality Research Initiative. (n.d.). Hazard & Operability Analysis (HAZOP). Retrieved from Product Quality Research Initiative: http://pqri.org/wp-content/uploads/2015/08/pdf/HAZOP_Training_Guide.pdf

[7] Rausand, M. (n.d.). Event Tree Analysis. Norwegia: Norwegian University of Science and Technology.

[8] II, C. A. (2016). Hazard Analysis Techniques for System Safety. In C. A. II, Hazard Analysis Techniques for System Safety (p. 16). Fredericksburg: Wiley.

[9] DARPA. Explainable Artificial Intelligence, https://www.darpa.mil/program/explainable-artificial-intelligence.

--

--

Nicolas Malloy
The Interlock

AV System Safety Engineer | Passionate about Resilience Engineering and Data Science