Chaos Engineering — Looking back to look forward

Thoughts on chaos engineering and AWS Fault Injection Simulator

Adrian Hornsby
Feb 10 · 13 min read

In this blog post, I want to look back a little bit because, as Rachel Carson said:

“To understand the living present and the promise of the future, it is necessary to remember the past.” — Rachel Carson

I love this quote and the work Rachel has done for our world — and today, I want to apply her wisdom to chaos engineering.

As you probably already know, Chaos engineering is the process of:

1) stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling

2) observing how the system responds, and

3) implementing improvements.

And we do that to prove or disprove our assumptions about our system’s capability to handle these disruptive events. But rather than let those disruptive events happen at 3 am, during the weekend, and in prod, we create them in a controlled environment and during working hours.

Resilience — one of my favorite topics — is also the most commonly quoted benefit of Chaos Engineering, and rightly so.

But today, as I am looking back, I would like to say that improving resilience using chaos engineering is just the tip of the iceberg.

Chaos engineering is not simply about improving the resilience of your system.

It also helps:

Expose monitoring, observability & alarm blind spots

Improve recovery time and operational skills

To name a few.

But most importantly, chaos engineering is a trigger for remarkable cultural changes.

Let me ask you couple questions:

Do you remember your first outage?

How smart did you feel back then?

How confident?

I was very nervous and sweating a lot.

My heart rate was through the roof.

I started to panic and couldn’t think straight. I felt very unprepared.

Decisions that normally would seem easy, seemed difficult and unclear.

I started making mistakes I normally wouldn’t.

It was a disaster.

I had no idea what I was doing.

It took me an awful lot of time to figure out the issues. I didn’t know where to start and what to look for.

I poked around for a long time, trying to get a sense of what could be wrong.

After what felt like an eternity, I found out the issue was the worker_connections setting on load balancers exceeded open file resource limit set in the limits.conf file of the Linux OS.

Today, it would probably take me only a few minutes to get a feel for what was wrong back then.

I wasn’t trained to fix and recover from outages.

Did you learn how to survive an outage at school?

All of us get trained on the field, during real outages. And that’s the real problem!

And that, too, took me a while to understand.

Few years after the “I have no idea what I am doing” event, I came across this fantastic talk from Jesse Robbins.

GameDay: Creating Resiliency Through Destruction” — Jesse Robbins

In the early 2000s, Jesse Robbins, whose official title was Master of Disaster at Amazon, created and led a program called GameDay, a program inspired by his experience training as a firefighter.

Firefighters — these highly trained specialists risk their lives every day fighting fires.

Did you know that one needs to spend approximately 600 hours in training before becoming an active-duty firefighter?

And that’s just the beginning. After that, some firefighters — according to reports — spend over 80% of their active-duty time in training.

Why?

To acquire that lifesaving intuition, they need to train hours after hours.

Like the old adage says, practice makes perfect.

Image source

“They seem to get inside the head of the fire, sort of like a Dr. Phil for a fire” —Helen Snyder, Fighting Wildfires With Computers and Intuition.

Jesse Robbins created the GameDay to increase Amazon retail website resilience by purposely injecting failures into critical systems.

During these exercises, tools and processes such as monitoring, alerts, and on-calls were used to test and expose flows in the standard incident-response capabilities. GameDay became very good at exposing classic architectural defects and sometimes also exposing what’s called “latent defects” — problems that appear only because of the failure you’ve triggered.

For example, incident management systems critical to the recovery process fail due to unknown dependencies triggered by the fault injected.

As a former volunteer firefighter, Jesse Robbins brought an emergency-responder mindset to his work.

He taught Amazon a few things.

First, with data centers distributed worldwide, a large e-commerce business, and even larger fulfillment operations, some unpredictable and spectacular failures were inevitable. And rather than trying to avoid failures, Jesse Robbins and his GameDays made it safe for Amazon to fail.

Second, that distributed, modularized applications are robust and enable a high level of reliability, availability, and resilience; a good foundation for what became AWS.

In that talk, Jesse taught me why it took me ages to identify the outage issue described earlier.

He said:

“Just as firefighters train to build an intuition to fight live fires, the goal with GameDays was to help the team build an intuition against live, large-scale catastrophic failures.” — Jess Robbins.

The question now is how?

How do you build intuition?

Today, the average person consumes approximately 50GB of data every day.

How do we deal with the complexity derived from this abundance of information? How do we cut through it all and make essential decisions in life and business?

Intuition has received increasing interest from scientists in the past few decades. And what they found is that intuition does exist.

Scientist and Nobel Prize winner Daniel Kahneman wrote the book Thinking, Fast and Slow.

Link

The book explains how the brain has two different operating systems; system one and system two.

System one is fast thinking and often dictates our subconscious way of operating. It is where feelings and intuition dominates.

System two is our slower, more analytical way of operating.

System one was proved to know the right answer in a wide range of situations long before system two.

E.g., people who had to make a quick, intuitive decision about their car purchase later found out to be satisfied with their purchase 60% of the time. Meanwhile, buyers who had a lot of time to think carefully about all the different options were only happy 25% of the time.

Our intuition is based on emotion, and it derives from things that have already happened to us, for example, when we’re kids — you know, when we burn ourselves on something hot.

Our intuition remembers, and the more we use it, the more accurate it can become.

However, I recently read the book Range by David Epstein, in which he says,

“In a wicked world, relying upon experience from a single domain is not only limiting, it can be disastrous.”― David Epstein, Range: Why Generalists Triumph in a Specialized World

“Modern work demands knowledge transfer: the ability to apply knowledge to new situations and different domains. Our most fundamental thought processes have changed to accommodate increasing complexity and the need to derive new patterns rather than rely only on familiar ones.”

And that is true at Amazon itself with the 2-pizza teams, in startups, small and medium-sized companies, or large enterprises.

Do your teams a favor, and let them gain experience in a wide range of skills, for only then will they have acquired the knowledge required and be able to learn and derive new patterns during these unknown outages.

David Epstein calls that Interleaving.

Interleaving refers to the mixing up of the types of problems you train on so that you don’t know what kind of situation is coming next. So, instead of practicing the same procedure over-and-over again, you’re forced to understand the structure of the problem.

After all, how many times have you experienced the same outage twice? (assuming you have fixed it the first time, of course)

Coming back to intuition. According to science, to build an intuition bank — you need to:

1- Increase your awareness

2- Fight biases

3- Train

Let’s dive into each of these.

1 — Increasing your awareness

Engage with people, ask questions, look at the data — but most importantly, listen to anecdotes.

We do that a lot at Amazon.

While Amazon uses plenty of metrics to make decisions, Jeff Bezos — founder and CEO of Amazon — explained why he reads customer emails and forwards them to the appropriate executive.

“I’m a big fan of anecdotes in business. Often, the customer anecdotes are more insightful than data. […] I’ve noticed that when the anecdotes and the metrics disagree, the anecdotes are usually right”.

“Where can we start?” — “what hypothesis should we start with?”

My favorite answer has become: “listen for anecdotes” from the team.

Ask your developer and operation teams:

“What are you worrying about?”

That is the single most powerful question — because often, if your team worries, it is intuition-based — and it is serious.

So serious that this question — “What are you worrying about?” — is part of every Operational Readiness Review (ORR) we do at Amazon before launching a new service.

Listen to these anecdotes. If several team members worry about the same thing, rank it up. And then design experiments that address these concerns.

Verify and never assume!

All these years, I have learned that listening to intuition has never been a waste of time. Quit the opposite — BUT — there is always a BUT.

Have you heard the saying?

“Dogs Not Barking”

The reference is to a Sherlock Holmes story by Sir Arthur Conan Doyle called The Adventure Of Silver Blaze. At one point, inspector Gregory asks Holmes:

Gregory: Is there any point to which you would wish to draw my attention?

Holmes: To the curious incident of the dog in the night-time.

Gregory: The dog did nothing in the night-time.

Holmes: That was the curious incident

The lesson of ‘dogs not barking’ is to pay attention to what isn’t there, not just what is.

Absence is just as important and just as telling as presence.

Paying attention to absence requires intentional focus. So, stop and also ask yourself:

“What are you NOT worrying about?

If you think about it, increasing our awareness is the higher form of increasing our monitoring and observability, simply not limited to software engineering.

Now let’s talk about biases.

2- Fighting biases

Interestingly — intuition often gets confused with bias.

Intuitions like: “It never happened before, it shouldn’t fail” — this is where intuition gets confused with bias.

Intuition is our genius insight, but we shouldn’t blindly follow it.

How to distinguish bias from intuition?

When you hear overly optimistic intuition, put it up to testing and validation as much as the rest — even more. Look for evidence that the intuition might be wrong.

Ask yourself if it looks like something you’ve seen before. If it isn’t, it’s a warning sign that it might not be intuition but bias thinking.

The two most common biases relevant here is the confirmation bias and the sunk cost fallacy.

The confirmation bias is “the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.”

The sunk cost fallacy is “the tendency for people to believe that investments justify further expenditures.”

Experimenting on a system — using chaos engineering — helps fights and challenge our assumptions and biases.

But be careful, biases are tricky and even dangerous because of the different factors that influence them. This is particularly true when team members are deeply invested in a specific technology or when they’ve already spent a lot of time “fixing” things.

So remember, handle with care. We all love our code, it is our baby.

And finally, training.

3- Training

To make it easy to train and improve intuition, you have to remove friction.

Our customers have told us that they feel it is hard to get started with chaos engineering.

Hard because currently, you have to stitch different tools, some scripts, and libraries together to cover the full spectrum of faults you can inject into a system.

Hard because you often have to know programming languages and OS-specific functions.

The infrastructure, network, and applications have different tools to do the job, whether it is a library or an agent.

Customers often don’t want to install anything extra in their applications — more things to install means more complexity.

And I am guilty here. Just take a look at the different tools, libraries, and scripts I myself have published in the last few years.

How easy is that to get started?

It is also challenging to ensure a safe environment to inject faults. Ideally, you want your tools to stop and rollback if any alarms are setting off automatically. You also want these to integrate nicely with the monitoring solution.

And finally, some faults are hard to reproduce.

It is essential to realize that outages rarely happen because of one single fault. It is often a combination of small faults occurring simultaneously or in a sequence, and that’s just hard to reproduce.

Similarly to how firefighters feel in a real live fire situation, the benefits of chaos engineering experiments correlate with their accuracy in reproducing real-world situations.

To remove friction, let customers train and learn about their system as easily as possible, we have recently announced AWS FIS — at reInvent 2020.

AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) is a fully managed chaos engineering service.

It makes it easier for teams to discover an application’s weaknesses, helps you uncover hidden issues, verify your system’s performance, and improve its observability, and resilience.

More importantly, FIS eliminates the need to manage complex tooling.

No need to maintain different libraries, scripts, or agents. All is in one place now.

Easy to get started

Getting started with the service is very easy; this is something we have put a lot of effort into since, as mentioned previously, our customers have repeatedly told it was hard to get started with chaos engineering.

You can use the console to get familiar with the service — try things out, explore.

You can then use the CLI to fully take advantage of the templates and integrate the service with CI/CD pipeline. This will enable you to repeatedly test the impact of failure as part of your software delivery process — something Casey has called continuous verification.

The templates are JSON or YML files that you can share with your team; you can version-control them to benefits all the best practices associated with code review.

Real-world conditions

You can run your experiments in sequence or parallel. Sequences are used to test the impact of gradual degradation, like a sequence of actions to increase latency gradually, and parallel experiments are to verify the effect of multiple concurrent issues — which is often how real-world outages happen.

FIS actions are identical to real-world events; for example, memory is actually consumed, and API requests actually return 500s.

Supported services at launch will be EC2, RDS, ECS, and EKS, and more will follow, so you can create complex conditions that span across a large set of AWS services.

Safeguards

Safeguards act as the automated stop button. A way to monitor the blast radius of the experiment and make sure that it is contained — and that failures created with the experiment are roll-backed if alarms go off.

IAM controls with tag-based policies can also be used to control what fault-types are used and what resources can be affected.

To learn more about AWS FIS, I invite you to check my re:Invent talk available on YouTube — it is a deep dive with plenty of demos.

More recently, I have talked about FIS with my colleague and friend Gunnar Grosch. In that episode, I have demoed some of the new control plane fault injections.

And finally, go watch Laura Thomson, Sr. Product Mgr for AWS FIS, launching the service live on twitch, seconds after Werner Vogels announced it during his re:Invent keynote.

Wrapping up

In summary, we have seen that to increase our intuition and fully embrace the benefits of chaos engineering; we have to:

1 — Increase our awareness — That is the broader domain of monitoring and observability.

2 — Setup hypothesis based on intuitions and avoid biases. Remember that dogs don’t always bark.

3 — Use tooling that promotes training and remove friction.

To conclude, I would like to leave you with a quote from Brian Tracy. She beautifully said:

“It is not failure itself that holds you back; it is the fear of failure that paralyzes you.”

Doing so has been the greatest and most gratifying experience of my professional life.

While the discipline itself is amazing, the people and the community surrounding it is where the diamonds are.

That’s it for now, folks!

-Adrian

The Cloud Architect

Resilient, scalable, and highly available cloud architectures.