The scientific method for resilience

Vanguard Tech
Vanguard Tech
Published in
11 min readMar 7, 2024

--

At a young age, most elementary school children spend time in science class learning about the concept of experimentation. They’re taught the scientific method, a cyclical process that consists of six steps to follow when conducting experimental research. These steps are probably familiar to the majority of readers: Observation, Question, Hypothesis, Experiment, Analysis, and Conclusion.

We’ve likely all experienced applying the scientific method in a school science lab, donning oversized white coats, latex gloves, and plastic protective goggles while tinkering with various solutions in beakers and test tubes. However, as professional technologists, it may have been many years since we’ve been in that environment, and chances are our memories of these hands-on experiments have faded. Our high-resolution computer monitors and tricked-out integrated development environments (IDEs) feel like a far cry from the long, cold lab tables where we partnered up with peers to complete our school projects.

Though our “labs” don’t look anything like the ones where we learned in school, the term “experimentation” still comes up in our jobs. Specifically, we talk about conducting experiments when we are trying to test our systems to ensure that they’re ready for prime time in the production environment. Our automated scripts and application programming interface (API) calls may not evoke the same scientist stereotypes as the primary school lab experiments where we dissolved salt into water or dropped objects from various heights to see how fast they’d fall, but much of the high-level process remains the same.

At Vanguard, we apply the scientific method to software resilience through the practices of Failure Modes and Effects Analysis, Chaos Engineering, documentation, and planning. In this post, adapted from a presentation given at QCon and SRECon in 2022, we’ll explain how each of our processes map back to the six steps of the method, and illustrate through a fictional example how you might apply the same steps to achieve improved system resilience in your organizations as well.

Scenario

Let’s imagine that we’re on a team that has been building a new business application, and we are preparing for a major release in just a few weeks. To build confidence in the new app’s ability to withstand various turbulent conditions in the production environment, and in our ability as engineers to respond to any issues that arise, we will conduct experiments that put our app to the test. Over the course of a hypothetical week, we’ll dedicate a portion of our time to completing the steps of the scientific method. Here’s our anticipated schedule for the week:

Step 1: Observation

Before we get to our Failure Modes and Effects Analysis meeting at 2 p.m., we’ll familiarize ourselves with the system architecture by reviewing an architecture diagram. In this upcoming meeting, we’ll be discussing the architecture diagram in detail, so it’s best if we aren’t seeing it for the first time when we get to the meeting. Usually, this diagram has recently been updated for accuracy and is sent out as a pre-read before the meeting.

Simple sample architecture

For the purposes of the hypothetical scenario, we’ll stick with a simple system architecture that follows a typical 3-tier structure. For this fictional business application hosted in the Amazon Web Services (AWS) public cloud, there’s a web user interface (UI), which is hosted in an Elastic Container Service (ECS) cluster, an API layer, also hosted in its own ECS cluster, and a Relational Database Service (RDS) data store as the back-end. In this web application, users might be performing both reads and writes against the database like retrieving their own saved user profile details and updating their email notification preferences.

It’s helpful to go into the Failure Modes and Effects Analysis meeting with a high-level understanding of both the key technical components of the system and the business process flow, if possible. By completing as much of the “observation” step as we can asynchronously, we are able to maximize our synchronous time in the meeting and jump right into asking and answering questions about the system’s expected behavior.

Step 2: Question

After reviewing the pre-read, we feel prepared to discuss the simple system architecture in our Failure Modes and Effects Analysis (FMEA) meeting. The meeting typically has one facilitator who may or may not also be the scribe. The facilitator isn’t necessarily a member of the application team who built this system — in fact, it can be helpful when this person is an unbiased third- party, as their questions may probe at unanticipated failure modes that the rest of the team hadn’t yet considered.

The facilitator starts by sharing the architecture diagram that was previously distributed and establishing ground rules for the scope of the discussion. For example, failure modes related to external request ingress and authentication/authorization may be left out of scope for a discussion if those patterns are already well established, so that the focus can be placed on new application components that have never been discussed before. Once the ground rules are set, the group begins traversing the architecture diagram component by component, led by the facilitator asking questions about the different ways each one might fail.

Anyone in the room is welcome to ask questions at any time. We have always encouraged the most junior members of our teams to attend FMEA meetings and ask questions when something is unclear to them. The FMEA meeting is a space where all questions are welcome, which makes it a great opportunity to gain a deeper understanding of how a system works. Through examination of the “unhappy path” scenarios, participants will generally leave with a better understanding of the “happy path,” too!

Though a real FMEA meeting would cover questions about each layer of the architecture, for the purposes of this blog post, we’ll focus on one key question raised about the data layer: In a hypothetical situation, what would happen if the RDS database became unavailable?

Step 3: Hypothesis

Based on what the team knows about the system already, they’ll discuss the answers to each question, and ultimately develop hypotheses when the group comes to a consensus. Of course, people may not always agree when deliberating expected system behavior. Quite often, this is where the most value can be gleaned from these meetings. It’s important to note any points of contention or uncertainty and mark them for further research, possibly through experimentation.

One thing notably omitted from our FMEA discussions is quantitative analysis. When we first introduced the FMEA at Vanguard, we included three quantitative dimensions for each failure mode, which we estimated on a scale from 1 to 10. Those dimensions were the likelihood of occurrence, the severity if the failure mode did occur, and the difficulty of detecting such a fault. While the purpose of these estimates was to deepen our understanding and aid with prioritization of test cases, the inclusion of quantitative estimation in the discussion more often led to splitting hairs and talking in circles, debating whether a certain failure mode was a five or a six on the likelihood scale for far longer than was valuable. Since the estimates were often not much better than ‘guesses,’ we didn’t believe that there was sufficient return on investment (ROI) for the excessive time we had invested. By eliminating the estimation from the FMEA exercise, we drastically reduced the time spent in the meetings while retaining the bulk of the value.

In our hypothetical scenario, the team is discussing how the system would behave if the database became unavailable. We can summarize our conjectures in the typical “if this, then that” hypothesis format. If the API layer can’t communicate with the database, then any write actions will fail. However, due to an in-memory cache built into the API layer, the same behavior is not expected in the read scenario. If the API layer can’t communicate with the database, then read actions will continue to succeed for a while, as long as we can serve up the in-memory cached data. This hypothesis is documented in the excerpt below from some fictional FMEA meeting notes.

Simple failure modes and effects analysis output

Step 4: Experiment

After our FMEA meeting notes have been published and distributed throughout the team, it’s time to decide what type of experiment we’d like to run on in our gameday activity on Thursday and prepare our environment for testing.

We came out of the FMEA with a lengthy list of documented hypotheses, and we don’t have time to test every single one. To prioritize our list, it’s great to start with scenarios where we expect the system to remain available. It isn’t as valuable to us to confirm that — yes, an outage of a certain component is probably going to cause us to have a rough day. There’s always a chance that our systems may surprise us with unexpected behavior, but it is better to be surprised by a system’s resilience than to be surprised by a client-impacting incident that we never anticipated.

With that in mind, the scenario from our fictional FMEA that we’ll look to test is the brief database outage scenario. In the interim, while we bring the database back online, we want to confirm our hypothesis that our application will still be able to serve clients’ read requests from the API’s in-memory cached data.

At Vanguard, we can test failure scenarios like this one in the non-production environment with our homegrown tool for chaos experimentation, the “Climate of Chaos,” which has been highlighted in multiple external presentations, including at SRECon in 2020. The specific tool that is used for fault injection may vary across organizations, as there are many options on the market, including large-scale vendor products, open-source libraries, developing custom scripts or toolkits, and manual efforts. For example, a team looking to test their own system against a brief RDS outage in non-production may be able to do so with just a few clicks in the AWS console.

A key component of any good resilience test is generating load, so before we inject our fault, we’ll use our performance testing tool, built with Locust.io, to start sending some low-to moderate-volume load at our system, performing a combination of writes and reads mimicking the expected real-user behavior.

Step 5: Analysis

After we trigger our experiment using the “Climate of Chaos” tool to temporarily shut down our database, we’ll conduct analysis of the system behavior using the available system telemetry to see if the actual system behavior is aligned with our expectations. In our hypothetical scenario, we’ll look at the responses to read and write requests from the client side and any error codes reported by the load balancers in front of the two ECS clusters. We’ll also take note of saturation metrics and scaling events for our clusters, and we’ll leverage our application logs and distributed tracing to better understand any failing requests.

In our hypothetical experiment, things don’t seem to go as planned. Initially, the system behavior matches our expectations — reads are succeeding while writes produce errors —, but after a minute or two of downtime, the system starts behaving in a way we didn’t expect. Saturation metrics are through the roof on our ECS cluster hosting the API layer, and auto scaling can’t keep up with the demand. ECS tasks start crashing left and right, and the new tasks that start up as part of auto-scaling and task replacement don’t have any cached data. The new tasks become overwhelmed with failing requests and the entire system begins to thrash. It doesn’t take long before all requests are failing, read and write!

Seeing the deviation from our expectations, we end our experiment early and bring our non-production RDS database back online. After a few minutes of recovery time, the system stabilizes, and we’re able to spend some time figuring out where we went wrong. As it turns out, our retry logic in the Web UI isn’t very smart. Every failed write request was being retried in perpetuity, with no exponential backoff and no limits. Since all write requests were failing, this quickly snowballed into a retry storm that overwhelmed our API layer’s ECS tasks, spiking saturation metrics and causing them to crash faster than they could scale. Each crashed task took out the in-memory cached data along with it, and replacement tasks spun up by auto scaling weren’t able to access the offline database to populate their own caches, leading to the failure of both write and read requests from that point on.

Step 6: Conclusion

Bringing the database back online allowed us to eventually stabilize the system, but the team knows that we can improve the behavior we observed during our experiment. Now, we’re able to draw some conclusions about the current state of our system and plan some actions to take to improve. We know now that we’re not as resilient to a database outage as we thought, and we can take steps sch as implementing a circuit breaker and better retry logic to prevent the retry storm scenario that we encountered. How will we know that our improvements were effective at resolving the issues we observed? We’ll run another experiment, of course. The scientific method can be repeated over and over again as we continue to learn about our ever-changing systems.

Below is an example documentation excerpt from the hypothetical completed chaos experiment.

Sample documentation and action plan

In addition to making our systems more resilient through technical implementation, it’s important that we always remember to document our work so that these learnings aren’t lost and the experiment can be repeated in the future. We’ll write down every step that we followed, including any preparatory actions we took to make our non-production environment more prod-like, a link to the script we used to produce load during the test, and the instructions to inject the database fault with the “Climate of Chaos.” We’ll include screenshots of the saturation metrics, error codes, and scaling events that we tracked. This will provide a helpful basis of comparison in future tests conducted long after our observability systems have purged the old telemetry data. All of these notes are published to our internal library of experiments so that any other application teams using similar architecture patterns can learn from our experimentation.

Though the scenario in this blog post wasn’t a real story from a Vanguard application team, it was inspired by the many very real meetings and experiments that we have conducted for our applications and platforms over the past several years. The practices of Failure Modes and Effects Analysis and Chaos Experimentation have become key components of our software development lifecycle for systems with critical availability needs. We hope that this blog post will inspire you to try out the scientific method for resilience in your own organization.

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 50+ million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

©2024 The Vanguard Group, Inc. All rights reserved.

--

--