A resilient system continues to operate successfully in the presence of failures. There are many possible failure modes, and each exercises a different aspect of resilience. The system needs to maintain a safety margin that is capable of absorbing failure via defense in depth, and failure modes need to be prioritized to take care of the most likely and highest impact risks.
In addition to the common financial calculation of risk as the product of probability and severity, engineering risk includes detectability. Failing silently represents a much bigger risk than the same failure that is clearly and promptly reported as an incident. Hence, one way to reduce risk is to make systems more observable. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Infrequent failures exercise poorly tested capabilities that tend to amplify problems in unexpected ways rather than mitigate them, so it’s important to carefully exercise the system to ensure that design controls are well tested and operating correctly. In the same way that we have moved from a few big software releases a year to continuous delivery of many small changes, we need to move from annual disaster recover tests or suffering when things actually break, to continuously tested resilience. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures. A learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system.
This discussion focuses on hardware, software and operational failure modes. It’s important to consider capacity overload, where more work arrives than the system can handle, security vulnerabilities, where a system is attacked and compromised, and safety failures, but we’ll defer those discussions for now.
There are many possible failure modes, and since they aren’t all independent, there can be a combinatorial explosion of permutations, as well as large scale epidemic failures to consider. While it’s not possible to build a perfect system, here are five good tools and techniques that can focus attention on the biggest risks and minimize impact on successful operations.
The first technique is the most generally useful. Concentrate on rapid detection and response. In the end, when you’ve done everything you can do to manage failures you can think of, this is all you have left when that weird complex problem that no-one has ever seen before shows up! Figure out how much delay is built into your observability system, it may be taking samples once a minute, processing them for a minute or two, then watching for several bad samples in a row before it triggers an alert. It would take 5–10 minutes after the problem occurred to appear as a problem, then people have to notice and respond to emails or pager text messages, dial into a conference call, and log in to monitoring dashboards before any human response can start. Collecting some critical metrics at one second intervals, with a total observability latency of ten seconds or less matches the human attention span much better. Try to measure your mean time to respond (MTTR) for incidents. If your system is mitigating a small initial failure, but it’s getting worse, and your team responds and prevents a larger customer visible incident from happening, then you can record a negative MTTR, based on your estimate of how much longer it would have taken for the problem to consume all the mitigation margin. It’s important to find a way to record “meltdown prevented” incidents, and learn from them, otherwise you will eventually drift into failure [Book: Sydney Decker — Drift into Failure]. Systems that have an identifiable capacity trend, for example are filling up disk space at a predictable rate, have a “time to live” (TTL) that can be calculated. Sorting by TTL identifies the systems that need attention first and can help focus work during a rapid response to a problem.
The second technique starts with the system constraints that need to be satisfied to maintain safe and successful operation and works in a top down manner using System Theoretic Process Analysis (STPA), or the more specific technique System Theoretic Accident Model and Processes (STAMP). [Book: Engineering a Safer World by Nancy G. Leveson]. STPA is based on a functional control diagram of the system, and the safety constraints and requirements for each component in the design. A common control pattern is divided into three layers, the business function itself, the control system that manages that business function, and the human operators that watch over the control system. The focus is on understanding the connections between components and how they are affected by failures. In essence in a “boxes and wires” diagram most people focus on specifying the boxes and their failure modes, and are less precise about the information flowing between boxes. With STPA there is more focus on the wires, what control information flows across them, and what happens if those flows are affected. There are two main steps: First identify the potential for inadequate control of the system that could lead to a hazardous state, resulting from inadequate control or enforcement of the safety constraints. These could occur if a control action required for safety is not provided or followed; an unsafe control is provided; a potentially safe control action is provided too early, too late or in the wrong sequence; or a control action required for safety is stopped too soon or applied for too long. For the second step each potentially hazardous control action is examined to see how it could occur. Evaluate controls and mitigation mechanisms, looking for conflicts and coordination problems. Consider how controls could degrade over time, including change management, performance audits and how incident reviews could surface anomalies and problems with the system design.
The third technique is lineage driven fault detection [Paper: Peter Alvaro]. The idea is to start with the most important business driven functions of the system and follow the dependency tree or value chain that is invoked when it is working correctly. You can then ask what happens from a business perspective when component level failures occur. Most online services have a sign-up flow that acquires new customers, and a value flow that is the main purpose of the service. For Netflix, the primary value flow is “streaming starts”, where someone starts watching a show, and the global total rates for new customers and streaming starts were made into a dashboard that was the starting point for most people during an incident, to see how big the effect on the business was. At some point Netflix also figured out how to measure the total number of customers dialed into their call centers globally and add it to the same dashboard, and this became a somewhat noisy but extremely valuable and sensitive metric for understanding whether customers were unhappy. Starting with the purpose for the system, we can walk through all the steps that provide value to its users, see what might go wrong with each step, come up with an observability and mitigation strategy, and find ways to run chaos experiments to validate our design controls. This is effectively a top-down approach to failure mode analysis, and it avoids the trap of getting bogged down in all the possible things that could go wrong using a bottoms-up approach.
The fourth technique is to apply the no single point of failure (SPOF) principle. If something fails, there should be another way for the system to succeed. For high resiliency systems, it’s even better to use the “rule of three” and quorum based algorithms. This is why most AWS regions have three availability zones. When there’s three ways to succeed, we still have two ways to succeed when a failure is present, and if data is corrupted, we can tell which of the three is the odd one out. When safely storing data, it’s very helpful to have three locations to write to, because once a majority have succeeded, you can move on. There is no need to retry and no extra time taken when a failure is present. For systems that are latency sensitive, creating two independent ways to succeed is an important technique for greatly reducing the 99th percentile latency. Chaos tests are also an important technique to validate the hypothesis that two mechanisms are independent.
The fifth technique is risk prioritization, and there is an industry standard [ISO] engineering technique called Failure Mode and Effects Analysis which uses a set of conventions for getting an estimated risk priority number (RPN) of between 1 and 1000, by ranking probability, severity and observability on a 1–10 scale, where 1 is good and 10 is bad, and multiplying them. A perfectly low probability, low impact, easy to measure risk has an RPN of 1. An extremely frequent, permanently damaging impact, impossible to detect risk has an RPN of 1000. By listing and rating failure modes, it’s easy to see which one to focus on. Next you record what effect you expect your mitigation strategy to have, which should drop its RPN and then let you focus on the new highest RPN, until there aren’t any high values left. In practice, the easiest way to reduce RPN is to add observability, so you aren’t working blind. You can then get some empirical measurements of probability, as once it’s visible, you can see how often it occurs. We’ll use FMEA to work through the sign-up flow example in a later section.
Taking a top down approach we can divide the failure modes into four general categories. Each category is centered around the responsibilities of a different team of people, and should be discussed and developed in partnership with those teams.
Following the value chain from the business perspective, the first team is the product managers and developers who specify and build the unique code that makes up the business logic of the system. The system itself could be a single microservice with a small team, which makes it easier to reason about, or a large monolithic application. The point is to focus on faults that are in the scope of control of the developers of the system, and that are tied directly to the business value of the application.
The second team is the software platform team, who provide standardized supported libraries, bundled open source projects, packaged commercial software, operating systems, build pipelines, language runtimes, databases, external high level services etc. that are used across multiple applications. They are indirectly supporting business value for multiple teams and use cases, have to manage components that they aren’t able to modify easily, and have to deal with a complex set of versions and external supply chains.
The third team is the infrastructure platform team, who deal with datacenter and cloud based resources. They are concerned with physical locations and cloud regions, networking failures, problems with infrastructure hardware, and failures of the control planes used to provision and manage infrastructure.
The fourth team, reliability engineering, provides observability, incident management and automated remediation tools for the entire system. Failures of observability and incident management can compound a small problem into a large one, and make the difference between a short and a long time to fix problems.
In all four cases, there is a common starting point and structure to the failure modes which should be extended to take account of a particular situation. The criticality and potential cost of each failure mode is context dependent, and drives the available time and budget for prioritized mitigation plans. The entire resiliency plan needs to be dynamic, and to incorporate learnings from each incident, whether or not the failure has noticeable customer impact.
Failure Modes and Effects Analysis
The FMEA spreadsheet is used to capture and prioritize risks based on Severity, Probability and Detectability where each is rated on a 1 to 10 scale. A standard model for each follows, the exact values chosen are somewhat arbitrary, and some forms of FMEA use a 1 to 5 scale, but all we are trying to do is come up with a rough mechanism for prioritization, and in practice this is good enough for the purpose.
Severity starts with several high levels that destroy things, in other words irreversible failures like death or incapacitation of a person, destruction of machinery, flood and fire in a datacenter. The next few levels are temporary incapacitation, recoverable with degradation of performance, and finally ratings of minor or no effect.
For probability, we use an exponential scale, from almost inevitable and repeated observed failures down through occasional failures to failures that haven’t been seen in practice. The probabilities are guesses during the design phase, but should be measured in real life when a system is operating, and the risk updated based on what is seen in practice.
The spreadsheet is organized into sections, listing failure modes for each function. There is also a recommended action, listing who is responsible and when, actions taken and the updated severity, occurrence and detectability that lead to a planned reduction in the RPN. The rows are shown split below for readability. The only formula needed is RPN=Sev*Prob*Det.
Application Layer FMEA
This first example FMEA models the application layer assuming it is implementing a web page or network accessed API. Each step in the access protocol is modelled as a possible failure mode, starting with authentication, then the access itself. This is followed by some common code related failure modes. For a specific application team, these should be discussed, prioritized and have additional failure modes added. Judgement and discussion is needed to finish filling in all the levels and actions, but some common failure modes have been completed.
Software Stack FMEA
The software stack failure modes start along the same lines, with authentication and a request response sequence analysis that needs to be repeated for each of the projects, packages and service dependencies. However the more specific failure modes relate to the control planes for services hosted in cloud regions. In general a good way to avoid customer visible issues caused by control plane failure modes is to pre-allocate identity, network, compute and storage/database structures wherever possible. The cost of failure should be weighed against the cost of mitigation.
It’s not generally useful to talk about “what to do if an AWS zone or region has an outage” because it depends a lot on what kind of outage and what subset of services might be impacted. Service specific control plane outages are part of the software stack FMEA. If a datacenter building is destroyed by fire or flood, we have a very different kind of failure than a temporary power outage or cooling system failure, and that’s very different to losing connectivity to a building where all the systems are still running, but isolated. In practice, we can expect individual machines to fail randomly with very low probability, groups of similar machines to fail in a correlated way due to bad batches of components and firmware bugs, and extremely rare availability zone scoped events caused by power and cooling failures, bad weather, earthquake, fire and flood.
Operations and Observability
Misleading and confusing monitoring systems cause a lot of failures to be magnified rather than mitigated. While some of the failure modes can be prioritized with an FMEA, these higher level failures are better modelled using Systems Theoretic Process Analysis (STPA), which also captures the business level criticality of the application. The service interactions that make up the monitoring system can be examined starting with the same patterns used for the applications FMEA, but it’s more interesting to look at the interactions with human operators and derive hazards from the information flows.
Simplified STPA Model
There is a lot more to SPTA but a simplified approach shows how it can be applied to the problems of finding failure modes in high availability systems. One of the models shown in the book is our starting point, showing the controlled process (data plane), the automated controller (control plane), and the human controller (who is looking at dashboards).
If we rewrite the labels to show a specific application, such as a financial services API which collects customer requests and performs actions, then the human controller monitors the throughput of the system, to make sure it’s completing actions at the expected rate. The automated controller could be an autoscaler that is looking at the CPU utilization of the controlled process, scaling up and down the number of instances that are supporting the traffic to maintain CPU utilization in a fixed maximum and minimum range.
If the service CPU utilization maxes out and throughput drops sharply, the human controller is expected to notice and decide what to do about it. The controls available are to change the maximum autoscaler limit, restart the data plane or control plane systems, or to roll back to a previous version of the code.
The hazards in this situation are that the human controller could do something that makes it worse instead of better. They could do nothing, because they aren’t paying attention. They could reboot all the instances at once, which would stop the service completely. They could freak out after a small drop in traffic caused by customers deciding to watch the Superbowl on TV and take an action before it is needed. They could do something too late, like notice eventually after the system has been degraded for a while and increase the autoscaler maximum limit. They could do things in the wrong order, like reboot or rollback before they increase the autoscaler. They could stop too soon, by increasing the autoscaler limit, but not far enough to get the system working again, and go away assuming it’s fixed. They could spend too long rebooting the system over and over again. The incident response team could get into an argument about what to do, or multiple people could make different changes at once. The run-book is likely to be out of date and contain incorrect information about how to respond to the problem in the current system.
Each of the information flows in the control system should be examined to see what hazards could occur. In the monitoring flows, the typical hazards are a little different to the control flows. In this case, the sensor that reports throughput could stop reporting, and get stuck on the last value seen. It could report zero throughput, even though the system is working correctly. The reported value could numerically overflow and report a negative or wrapped positive value. The data could be corrupted and report an arbitrary value. Readings could be delayed by different amounts so they are seen out of order. The update rate could be set too high so that the sensor or metric delivery system can’t keep up. Updates could be delayed so that the monitoring system is showing out of date status, and the effect of control actions aren’t seen soon enough. This often leads to over-correction and oscillation in the system, which is one example of a coordination problem. Sensor readings may degrade over time, perhaps due to memory leaks or garbage collection activity in the delivery path.
The identified STPA hazards in each control or sensor path in the system need to be prioritized in a similar manner to the FMEA failure modes. They also provide good inputs for test cases to make sure the code used to sanitize inputs from each sensor flow can mitigate each hazard.
The STPA three level control structure provides a good framework for asking questions about the system. Is the model of the controlled process looking at the right metrics and behaving safely? What is the time constant and damping factor for the control algorithm, will it oscillate, ring or take too long to respond to inputs? How is the human controller expected to develop their own models of the controlled process and the automation, and understand what to expect when they make control inputs? How is the user experience designed so that the human controller is notified quickly and accurately with enough information to respond correctly, but without too much data to wade through or too many false alarms?
The STPA model shown assumes a single instance of the controller. The next step, for a future post, is to derive a multi-region STPA model and examine the hazards during a disaster recovery primary to secondary failover event.