At the Sharp End of the Knife: Lessons on SRE Strategy from the Healthcare Space

Published in

Sarah on SRE

10 min readApr 30, 2021

I often describe the discipline of Reliability Engineering (SRE) as “being an emergency room doctor for the world’s technical systems.” As I undertook this journey to apply frameworks and learnings from other industries to SRE, I wanted to explore how best practices in healthcare could be applied within the discipline of SRE.

Healthcare is a field full of “life and limb” critical decisions. Over the past few decades, significant work has been done to find ways to make healthcare safer and more accurate. In the early days of technology, the reliability of technical systems was seldom considered critical. But as technology has become more central to our daily lives, technical systems have moved from being auxiliary systems, to important business systems of record, to sometimes even life and limb type critical systems. As this convergence has taken place, the importance of reliable systems had gone from a luxury to a necessity. In this paper, we will examine two lessons from healthcare that can improve SRE strategy and execution.

Lesson 1: Operating at the Edge of Failure (Rasmussen’s model)

Rasmussen’s model for mapping system failures was developed in 1979, initially introduced in his paper “Risk Management in a Dynamic Society: A Modelling Problem,” and has since gained widespread respect for its ability to stand the test of time and far-reaching application to all industries. Embraced by the medical community in the late 1990s and early 2000s as the concept of “gradients of failure,” the model was later used to examine points of failure in vital systems such as factories, nuclear reactors, and other high-risk situations. In 2012 Dr. Richard Cook, an anesthesiologist by trade who helps lead Ohio State University’s Cognitive Systems Engineering Lab, first brought the idea to the technology field.

The basis of Rasmussen’s model is that there are three boundaries. If an effort resides within the three boundaries, it will not experience failure. Once outside the boundaries, failure is experienced. The three boundaries are the boundary to economic failure, the boundary to unacceptable workload, and the boundary of functionally acceptable performance.

The boundary of economic failure is the point at which an effort is so expensive there is no longer either positive ROI or long-term strategic value that outweighs the short-term costs. In most cases, efforts are made to push away from this boundary by management seeking efficiency. Occasionally, such as in the case after a substantial incident, other parts of the business may forcefully push back towards the boundary (such as when an incident causes a significant public image problem and the phrase “I don’t care what it costs, make sure it never happens again” is uttered by leadership). Ultimately, these efforts tend to produce short-term change as eventually even the most significant incidents fade into memory.

The boundary to unacceptable workload is the edge at which people start to fail. This failure can be derived from unsustainable workload, unhealthy work environment, and other workforce-related factors. In many cases, the factors pushing away from this boundary can be defined as two types: the gradient towards least effort, and the efforts of employees to create a sustainable and healthy work environment.

This leaves the final boundary, the boundary of functionally acceptable performance. Due to the strong forces pushing away from the other boundaries, this boundary is most often breached, often resulting in a technical system outage. The tricky part of managing the functional acceptable performance boundary is that it is difficult to determine. Unlike bridges with safety factor ratings, many technical systems, particularly those with legacy components, are designed and spoken of in terms of inclusion, not exclusion. A system is most often architected with specifications to work within known boundaries, not to fail at known limitations. Thus, ideas begin to form about the pressure a system can take before it might fail, without any definitive answer as to when the system will truly fail because to push to that point would cause a business-damaging outage. This difference between where people believe the system might fail and where the system fails is called the unknown error margin, resulting in a perceived boundary of acceptable performance that does not always reflect reality.

Rasmussen’s Model (Source: https://brooker.co.za/blog/2014/06/29/rasmussen.html)

When this model was original presented to the technical community in 2012, the discipline of SRE was in its infancy. In the years since, several SRE strategies have been developed that can help companies define and modify the boundary of functionally acceptable performance. Presented here are three strategies where SRE thinking can aid in managing this boundary to ultimately increase customer perceived reliability.

Strategy 1: Turn the Unknown into the Known
Chaos Engineering and other types of “controlled boundary-pushing” is the first SRE strategy that can be used to help better understand where the failure points occur in a system and transition from an unknown error margin to a known error margin. This information allows for business optimization on both the economic and workforce axis without risking hitting an unknown point of failure. SRE can support these efforts through controlled, planned stress testing and failure exercises through disciplines such as chaos engineering. Especially when dealing with older systems and legacy codebases where parts of a system were designed before the system became business-critical, identifying these boundaries allows for either proactive remediation or identifying and avoiding the point at which a system will cross the boundary of functionally acceptable performance.

Strategy 2: Enlist Automation for Perpetual Tension
Discussed above is the role of “safety campaigns”, generally in response to an outage. This is the traditional flurry of post-outage activity that involves instructing an organization to proceed with extreme caution to prevent another outage (anyone familiar with the “post outage knee jerk change freeze” situation?) Eventually, the collective memory of the outage fades, and the gradients of economics and workload being to overpower the “safety campaign”, creating temporary results. In contrast, creating automation and low-overhead, low-friction processes with little or no human effort needed allows for a lasting version of a traditional safety campaign that is not nearly as easily moved and instead pushes back on the other two boundaries with perpetual tension. Once created, [good!] automation and processes should require significantly less economic and human resources to maintain than a traditional “safety campaign”, and are largely excused from “safety fatigue” and being lulled into that false sense of security once an outage fades into memory that humans are prone to. This creates a repeatable way to continuously hold performance in the area within the three boundaries by creating a perpetual gradient away from the boundary of functionally acceptable performance that stands in tension with the other gradients.

Strategy 3: Bounce, don’t Smash
The third SRE strategy that can be used revolves around making the boundary of functionally acceptable performance “bouncy.” SRE can encourage and enable a model of service ownership that considers reliability to be an important non-functional requirement (in contrast to the traditional “build the features and chunk it over the proverbial fence because keeping it up is IT Ops’ problem” mindset). By building in features such as graceful degradation, appropriate queueing, caching done correctly (watch out for this one!), failure pathways, and other “invisible to customer resiliency,” systems can be designed to adapt to changing conditions to prevent failure dynamically. Some additional examples of this include using modern cloud concepts such as elasticity to dynamically manage pressure on a system and allocate more resources when needed and using predictive analytics to mitigate failures before they occur. These types of efforts apply a “resiliency gradient” that can help counteract the pressures of the efficiency and workforce gradients that push towards breaching the boundary of functionally acceptable performance. The idea here is to create some cushion on the boundary of functionally acceptable performance, allowing a system to either “bounce” against it back into acceptable performance, or “absorb” into it while minimizing impact. It is highly unlikely a complex system will ever perform at 100%, but it is entirely possible to gracefully degrade in such a way that your system doesn’t appear to customers to be smashing into a million pieces as it crosses into failure.

Lesson 2: Halt Impact and Prevent Further Damage with a First Responder Mindset and Frameworks

Anyone who has watched a medical drama is probably familiar with the concept of a code. Doctors swarm the patient, CPR is administered, various medicines are given urgently, and so forth. The casual TV viewer may not realize that each time physicians run a code, they follow a series of frameworks and runbooks called ACLS or Advanced Cardiac Life Support. ACLS protocol is triggered by a significant event (the patient has an irregular heart rhythm and is approaching or already clinically dead) and follows various algorithms based on the symptoms presented. A patient who has SVT, for example, would be given the drug Adenosine, whereas a patient who is in VF will receive CPR and a shock. When the patient is stabilized and returned to normal sinus rhythm, they are placed on monitoring and transferred to ICU, where specialists will determine the exact cause of their heart issues and how to resolve them.

Much like a patient going into cardiac arrest in the ER, SRE can be considered the “emergency room” of technical response teams. This analogy highlights the value in having general frameworks (“ACLS algorithms”) and a first responder mindset to stabilize systems and bring customers out of impact, sometimes even before determining the exact cause of failure. Far too often, when a technical system breaks, teams immediately huddle on a call trying to figure out what went wrong. I would propose that is the wrong question to ask in an emergency (there is a reason you don’t see medical staff stopping to ponder the intricacies and nuanced details of various organs during a code!). Instead, much like in ACLS, the question should be, “what are the things currently broken that are of immediate concern, and how do we bring the system out of impact.” Similar to how ACLS categorizes large buckets by hearth rhythm with pre-determined frameworks and runbooks for each, teams should break down complex technical systems into general buckets (i.e., “a database block” “a network switch failure” “a problem caused by a code release” etc.) and determine a general runbook for how to quickly bring an incident caused by each of these out of impact. Again, the focus is not on solving the problem — it is on quickly getting the system to a stable out of impact state. For example, if a network switch fails, the runbook would address how to promptly failover to a different switch with little concern about how to fix the original switch. If a problem is caused by new code released in the environment, the runbook should address how to quickly roll back the code, not figure out what line of code has caused the problem.

Similar to how an emergency department patient is transferred to ICU for specialist involvement and more precise diagnostics once vital signs have returned thanks to the efforts of those following plans like the ACLS algorithm, once runbooks have successfully brought a technical system to a stable state that is out of impact, only then should service owners and development teams focus on discovering what caused the system outage and how to fix the underlying causes. The idea is to paint with broad brush strokes, using a first responder mindset, before painting in the details at the appropriate later time. This focus on bringing a system outage out of impact through general frameworks and runbooks before finding and resolving the underlying problems or factors causing the issue provides a standardized way to reduce the business impact of an outage and shorten the time it takes to bring a system out of impact.

Lesson 3: Don’t Leave to Memory What You Can Checklist
Dr. Marty Makary, a surgeon at Johns Hopkins, is widely credited for his work creating the concept of a surgical checklist to improve patient safety and outcomes during surgery. The process is supported by the World Health Organization and now used globally, but many would be surprised at how simple it is. The checklist covers 19 items in three phases — “sign in,” “time in,” and “time out.” The questions are seemingly obvious questions, such as “does the patient have any known drug allergies,” “is this the correct surgery site” and even introducing surgery team members by name and role. The rationale behind this seemingly obvious set of questions is simple — why leave to memory or chance what you could trust to a checklist? Why create opportunities for small errors to occur early in a process that potentially have catastrophic consequences down the line?

Much like a surgical checklist, there are critical areas of SRE that benefit from checklists. While almost every organization uses some form of checklist for change management in their production environment, fewer organizations use it in areas such as major incident bridges. A useful exercise is to think well in advance of the next Sev0 incident and consider what all should be put on a checklist. What are the roles you need to ensure are on an incident bridge? Do you need to implement a temporary halt to other changes in the production environment? What logs need to be dumped before bouncing a server or resetting IIS? And so forth.

The basic concept is that, when faced with the adrenaline of a major outage, people may forget to do things they would otherwise have no problem remembering. Steps may be skipped that would ordinarily not be forgotten if not for the stress of the situation. People’s attention may be splintered. Everyone may assume someone else was taking care of that one important “thing”. Don’t create an unnecessary failure point — don’t leave to chance or memory what you can checklist.

In Conclusion:
As technical systems become increasingly critical and complex, they start to closely mimic systems in other industries regarding criticality and complexity. While much can be learned from applying lessons from fields such as transportation, construction, and energy to the discipline of SRE, some of the most substantial insights come from the healthcare space, a daily profession of both life and limb criticality and extreme complexity.

At the Sharp End of the Knife: Lessons on SRE Strategy from the Healthcare Space

Written by Sarah Butt