A New Paradigm for Improving System Resilience: Safety-I vs. Safety-II

Leandro Fernandez
syngenta-digitalblog
8 min readNov 13, 2023

--

Resilience and availability are our top priorities for Syngenta products. We firmly believe that delivering high-quality services is the only way to build customer trust. For this reason, within the Digital Product Engineering (DPE) group, we continually seek innovative ways to achieve operational excellence in all the services and products we develop.

In this article, I will introduce Erik Hollnagel’s groundbreaking concept of Safety-I vs. Safety-II and explain how we anticipate this new model will enhance Syngenta’s ability to improve the resilience of our digital systems. While we are still in the early stages of implementing Safety-II at Syngenta, we believe it will lead to breakthroughs with a profound impact on how we design our services, especially in identifying ways to enhance the resilience of our products.

Safety-I: The Traditional Approach

For a long time, the industry has relied on Safety-I as the standard for resilience management. This reactive approach involves waiting for something to go wrong, tracing back to identify the root cause, and implementing measures to prevent or mitigate similar errors in the future. In most of the companies we engage in certain practices to enhance our response to critical situations: mainly outages and incidents. For instance, we conduct annual disaster recovery exercises with the intention of improving our systems and refining our runbooks, ensuring we are well-prepared in the event of a significant outage. Additionally, when incidents strike our production systems, we undertake postmortems or root cause analyses as part of our efforts to learn and identify areas for improvement. But that may not be enough, why?

As Erik Hollnagel states, while Safety-I has helped us build more reliable systems, it has its limitations and flaws, and it is limiting our efforts to build more resilient systems. These are the main reasons:

1. Complexity: Debugging complex systems is challenging due to cascading cause and effect relationships. Identifying the root cause is often difficult or even impossible, as complex systems do not adhere to a linear cause-consequence model.

2. Expertise and Bias: Postmortems and incident analysis outcomes can depend heavily on the individuals involved in the process. As more technologies, components, or systems are involved, gathering the necessary expertise to identify the correct root cause and implement improvements becomes more challenging. Therefore, final conclusions of postmortems might be biased.

3. Lack of Holistic Understanding: Addressing root causes in isolation without a comprehensive understanding of the problem leads to ineffectiveness.

4. Overreliance on Defense Layers: In our attempt to prevent future incidents, we often focus on increasing compliance and implementing additional layers of defense. However, it is important to recognize that each layer introduces its own set of flaws and vulnerabilities, as illustrated by the Swiss cheese model.

The Swiss Cheese Model (author: User:BenAveling; license)

5. Human Accountability: Safety-I often perceives humans as liabilities rather than contributors to problem-solving, potentially causing disengagement among development teams. This situation becomes evident during postmortems, where companies aim to create blameless environments. However, there is a tendency to attribute blame, resulting in certain teams being stigmatized. Unfortunately, these stigmatized teams often become disengaged and unmotivated, ultimately leading to an increase in incidents and perpetuating a downward spiral.

6. Limited Learning Opportunities: Safety-I primarily learns from failures and unwanted outcomes, which represent only a small percentage of normal operations. This limited focus prevents the broader understanding required for resilience improvement.

At Syngenta we are currently implementing Safety-I approaches, such as: annual disaster recovery exercises, incident management processes that include root cause analysis and, we are starting to dive into de execution of gamedays and the use of chaos engineering. The benefits of using those Safety-I approaches have been evident in terms of improvements in reliability and readiness. However, we have encountered certain limitations like the ones described in the points mentioned above. In order to enhance our practices, it is important to address these limitations and explore methods to overcome them. But…How can we improve further?

Safety-II: A New Perspective on Resilience

Safety-II presents a fresh perspective on resilience and safety, shifting away from considering people as liabilities and instead viewing them as the solution to problems. This approach involves teams actively and continuously striving to anticipate events and incidents. Safety-II focus on investigating and analyzing why and how things typically go right, serving as a foundation for understanding occasional failures.

Things usually go right most of the time. However, we often tend to focus on incidents and outages, which represent only a small fraction of all the events occurring within a system. Safety-II takes a different approach, telling us to concentrate on events in which the system functions as expected and learn from them. Since the system operates effectively most of the time, there are plenty opportunities to gain insights from these instances. Typically, we do not invest time in studying when a system functions well, as the logic goes: if something is working correctly, why invest more time in it? However, Safety-II suggests that in order to enhance resilience, we should dedicate efforts to understanding why things succeed and develop a better grasp of the mechanisms underlying expected performance

In his Safety-I to Safety-II whitepaper, Erik Hollnagel emphasizes that the purpose of investigation within Safety-II is to comprehend how things consistently go right, serving as a basis for explaining occasional failures. Within this context, humans are seen as a valuable resource that contributes to system flexibility and resilience. Teams and developers need to actively invest time in comprehending why and how their systems operate effectively and anticipate how minor variations can potentially lead to incidents.

From Reactive to Proactive

To shift towards Safety-II, we need to redirect our attention to events producing an expected outcome of the system, which are often ignored as they appear to work well. The goal is to learn from what goes right, understanding the reasons behind successful system performance and anticipating potential failures.

Erik Hollnagel also suggests focusing on frequent events rather than emphasizing severity. By examining everyday performance variability and making small improvements, we can enhance system reliability. Small yet frequent events are easier to understand, and fix compared to rare events with severe consequences. This is a very interesting thought, we often try to focus on the most severe incidents to carry on improvements and refactoring on our systems, but we might be missing the point completely. The real value may rely on focusing on frequent incidents!

It’s important to note that adopting Safety-II doesn’t imply abandoning Safety-I. Analyzing incidents and identifying root causes remains crucial. However, the real potential lies in complementing this approach with a focus on understanding how things go right.

Using DORA metrics for a Safety-II transition

In the area of DevOps and software engineering, the DevOps Research and Assessment (DORA) metrics have emerged as valuable indicators of high-performing teams. These metrics, including deployment frequency, lead time, change failure rate, and time to restore service, provide insights into team efficiency, productivity, and reliability.

In Syngenta we are starting to adopt Safety-II principles, levering DORA metrics to identify teams that consistently demonstrate positive outcomes and exhibit excellent performance. These high-performing teams serve as valuable sources of best practices and examples of effective system resilience. By analysing their performance metrics and practices, we can gain insights into what contributes to their success.

Rather than solely focusing on teams with a history of incidents or failures, Safety-II encourages us to shift our attention to the teams consistently delivering exceptional results. By studying these successful teams, we can learn and replicate their approaches, while also identifying patterns and practices that contribute to their consistent performance.

The use of DORA metrics enables us to objectively measure and assess team performance, providing empirical evidence of their effectiveness. By identifying top-performing teams through these metrics, we have the opportunity to understand the factors driving their success and leverage that knowledge to improve the resilience of other teams and the overall system.

Combining the principles of Safety-II with DORA metrics allows us to foster a culture of learning from success, rather than solely focusing on failures. It enables us to identify and amplify best practices, encouraging the wider adoption of successful strategies for enhanced resilience. By recognizing and celebrating high-performing teams through the lens of Safety-II, we can further drive improvements in system reliability while promoting collaboration and knowledge sharing across the organization.

This is a cultural change

The transition from a Safety-I culture to a Safety-II culture, or integrating both, poses a challenging cultural shift. Initially, our focus is primarily on reacting to outages and incidents, often neglecting the allocation of resources towards availability and resilience until a major outage affects the company. Similarly, investments in data security usually receive inadequate attention until the first breach occurs.

To facilitate this cultural change, as a first step, it is important to bring visibility of the current status of availability, resilience and delivery key metrics. This is always the first step when dealing with a new challenge: find out where you are, identify key metrics to measure the current status and start meaningful discussion around those metrics. Those discussions are the foundation that leads to a behaviour change and eventually to a cultural shift. That is why in Syngenta we started our Safety-II journey streamlining the use of DORA metrics across the whole organization, providing visibility of key operational and delivery metrics for all our products and services, and creating forums of discussion around those metrics. This has been our first step, but more are coming.

In the coming months we will be using those metrics to identify teams that are overperforming from which we can learn and select best practices that can be spread across the whole organization. We will identify systems that are performing well in the same operational stress situations where others are underperforming. We will have a better understanding of how our systems work, and this will help us to identify our weak points and bottlenecks.

At Syngenta, we are just starting this exciting journey to a Safety-II driven organization, the road ahead is still full of uncertainty on how to use properly those concepts in a DevOps and SRE environment and on understanding what Safety-II really means in the context of digital product development. In coming posts, we will share with you our journey and our learnings, stay tuned!

--

--

Leandro Fernandez
syngenta-digitalblog

Global Head of DevOps and Software Reliability in Syngenta Digital Product Engineering