Resiliency and Chaos Engineering — Part 6

Pradip VS
10 min readMar 30, 2022

--

In this part, we will talk about Chaos Engineering & its phases, who can do it, how can an organization plan by bringing the right teams, and its benefits.

This part is in continuation to part 1, 2, 3, 4, and 5. Kindly go through them to get a broader context.

Firefighters and firefighting are commonly used as analogy for Chaos Engineering. Source: Bing

Before talking about Chaos Engineering there are two aspects that I would like touch base which gives a context and how it improves resiliency.

First and foremost, Resiliency engineering (part 1 to 5 of this series), where we saw various patterns and best practices will reduce / minimize failures. But once you have applied those best practices, how will you evaluate it? To test if they are working and reducing the failure rates or helping system to recover from failures quickly, we need a technique. We cannot wait till next production outage to verify it. We need some sort of simulation (in production environment) to verify if our architecture & systems are resilient and that is where Chaos engineering or Chaos testing principles & practices helps.

There are some posts that speak resiliency and chaos engg as same but there is a difference, which is key to understand. Resiliency engg applies architectural best practices to reduce the system’s failure while chaos engineering helps validate it by running experiments.

Ecommerce apps have many- many moving parts. Source

One more key thing to note is, the ecommerce giant with whom I work closely performs a lot of stress tests as well as resiliency tests throughout the year. Stress test is for checking the robustness of the system to see how far the systems can stretch in case of an event (usually they have OPM (Orders Per Minute) and PVH (Page Views per Hour) as base metric to validate it) while resiliency tests check how far their systems recover post failures or how quickly the system failover/fail back from/to another region in an event of failures. Understanding these differences is important as there are a lot of moving parts in a complex applications like ecommerce.

Let us talk about firefighting!

Firefighting principles applies very much to Chaos Engineering. Source: BabyBus

I highly recommend watching this video on the Game Day and firefighting — GameDay: Creating Resiliency Through Destruction — YouTube

Why is firefighting compared with chaos engineering?

Before becoming an active-duty firefighter, they involve in 600 hours of vigorous training and spend 80% of their active-duty time in training.

Why?

Because they need to build an intuition of the fire.

A fire in forest needs a different strategy to be handled than a fire in middle of an ocean caused by spill of oil or a fire in skyscraper. Even though all are fire, the extinguishing strategy is not the same. We need to have a different intuition for different fires scenarios.

Just as firefighters train to build an intuition to fight live fires, the goal of Chaos Engineering is to help teams build an intuition against live, large-scale catastrophic failures.

To make chaos engineering successful, intuition plays a key role but to avoid biases, we need to do the following,

  1. Engage with different people / teams to increase the awareness
  2. Create hypothesis backed by data — There should not be a thought like “I feel like this so this will not happen” — everything needs a data backing
  3. Help the team with training and provide tools like Azure Chaos Studio to help them validate the various hypotheses.

Chaos Engineering

There are various definitions and let us see a few popular ones below,

Chaos engineering is the practice of subjecting a system to the real-world failures and dependency disruptions it will face in production.

http://principlesofchaos.org

Fault injection is the deliberate introduction of failure into a system in order to validate its robustness and error handling.

Netflix Blog

Chaos engineering can be used for a wide variety of resilience validation scenarios — shift right or shift left

Shift right — preparing for or running in production. Shift left — ensuring resilience earlier in the development cycle

Shift right — Need real user traffic and real customer data:

Targeted HA drills

BCDR

Game Days

AZ / Region outages

Validate on-call, live site process, monitoring

Shift left — Using no or simulated load, or a small percentage of user traffic, gating deployments:

Pre-production, canary validation

Gate production code flow with CI/CD automation

The phases of Chaos Engineering

Before starting with the phases, the key aspect is to do our homework properly. Below is the non-exhaustive list of best practices to be applied.

Source: Adrian Hornsby Blog

At Microsoft, we follow a four phased approach while performing Chaos Engineering,

Microsoft’s Chaos Engineering Life Cycle.

The Chaos Engineering life cycle consists of five phases in some firms, where they start with defining the steady state or baseline metrics followed by hypothesis. Here at Microsoft, we club them both as every hypothesis needs a steady state / baseline metric to be defined (for e.g.: OPM in case of ecommerce giant. We define peak time OPM, off peak time OPM).

Alternatively, some companies use a 5-phase approach for Chaos Engineering

Hypothesize

In this step, the system’s normal behavior or what we call baseline is defined. So, in case of any unexpected events and once the system recovers the first metric that helps one to understand if things are back to normal is the baseline metrics. In case of the ecommerce giant, we define OPM or the Orders per minute, which gives a measure of customer experience & operational metrics. If this is not done, then one cannot observe drifts in the steady state.

Start the hypothesis and ask the questions like,

What if 75% of the pods restart?

What happens if Azure Cosmos DB throttles or arbitrary loss of network packets or loss of connection?

What happens if the VMs are either experiencing high latency or some VMSS are having bad nodes?

What happens if ports are blocked or the datastore or cache fails?

What if my product page latency is more than 500ms for 10 minutes?

Start small, pick a valid hypothesis, talk to the entire team, ask them to list down the scenarios, what will be their Plan B, has the team created a playbook? are the scripts that invokes a DR/BC plan automated? what will be the impact of this issue to the upstream or downstream system? if the application chosen is a critical or non-critical one?

Let us take the case of a simple scenario in the video streaming site

If the Recommended Movies section is not available

Amazon Prime. My Personal Subscription

Will you show a 404 or a blank box?

Or you will fill with degraded experience by populating Amazon Originals in the place of recommended movies?

The outcome of this phase is to outline clear expectations and outcomes.

Experiment

In this phase there are four things to be addressed and they are,

Pick one hypothesis
Scope your experiment
Identify the relevant metrics to measure
Notify the organization

The best way to pick a hypothesis is to take the outage history, top 3 or 5 frequently reoccurring issues.

Before the experimentation is done the following should be taken care,

The Principles of Chaos preach chaos engineering in Production. This should be the end goal but NOT where anyone starts (Start in lower environments. We recommend our clients to test Cosmos DB Region Failover in lower environments)

Earn trust first and do it in Prod. If one does directly in Prod & fails miserably, then the SRE will lose the confidence of the organization.

Have an emergency stop button or a way to stop your experiment to get back to the normal steady state as fast as possible.

One of the most important things during the experiment phase is understanding the potential blast radius of the experiment and the failure you’re injecting — and minimize it.

Best way to start Chaos testing is to do it in a canary region and then advance to the critical regions where Customer traffics are observed high.

The outcome of this phase is to experiment but in a controlled way where the impact radius is known and there are ways to immediately stop it when unexpected results are obtained.

Analyze

The only way we can analyze, learn and improve the resiliency of the system is by measuring every aspect (Recollect Part 5 where we recommend our customers to measure everything). Investment should be more in measuring systems.

If a major disaster is encountered then there is no one isolated cause, always the problems start small and compounds into a major incident. Do a postmortem or analysis of each issue completely and put the learnings into a document and create a playbook out of it.

The analysis document should have at least the below aspects covered,

1 — What happened — timeline wise?
2 — What was the impact to the customers?
3 — Why did the error occur?
4 — How is the error mitigated? Is it a long term or short-term mitigation?
5 — What did you learn?
6 — And how will you prevent it from happening again in the future?
7 — What is the impact to our firm?

Point #7 is important for Microsoft as we need to measure both the customers impact as well as our impact as a result of our platform. What is the loss (brand / monetary loss) incurred by our customers as well as for us and what are the ways we can follow to mitigate it.

At Microsoft, Our Customer’s Customer Problem may impact our trust & brand and hence we are taking enough measures to make our platform resilient.

This phase’s outcome is to analyze each and every experiment’s outcome, create a play book and see how to avoid/minimize most of the errors proactively. Do a continuous inspection in the form of weekly operational metrics review meetings.

Improve

The most important lesson here is to prioritize fixing the findings of your chaos experiments over developing new features!

Enforce the Leadership team that process and buy into the idea that fixing current issues is more important than continuing the development of new features.

Chaos Engineer

Since it is a fairly new stream, we will get many questions like

  • Who is a Chaos Engineer?
  • Best set of people to start looking into Chaos Engineering?
  • Whose responsibility does this fall?
  • How can performance engineers drive chaos engineering ideas?

Let’s make some myth’s clear

  • A Chaos engineer will NOT go around service teams and surprise them with breaking things randomly, without noticing them.
  • Chaos engineers are more likely to be advocates, helping teams on Chaos Engineering, preparing and execute various experiments to test the resiliency.
  • They work WITH the team NOT against them and they are evangelist and not the one that pulls the trigger.
This is not what a chaos engineer does ;) Source

Now comes the next set of questions,

Who can be a Chaos Engineer?

  • Chaos engineering is a practice more than a job definition. Thus, everyone in Software engineering or ops team can do to improve systems.
  • Best person to do fault injections are the ones most intimate with the software systems viz. The Developer!
  • Best way to start is elect a champion who is very strong in software engineering background and passion for it.

Best way to start?

  • Work with teams and decide which application/systems to be tested every week/month.
  • Ask simple questions like “What do want to learn?” & Create program/goals around it.
  • Have realistic goals, start simple and gain confidence
  • Chaos engineering experiments should be treated as a deployment pattern.
  • The safest way to inject failure in the environment is by using the canary deployment pattern.

This is a great blog to understand the role of Chaos Engineer — Chaos Engineering — What and who is a chaos engineer? | by Adrian Hornsby | The Cloud Architect | Medium

While this is all cool, what are the benefits of following all these. Well, there are a handful of benefits following the resiliency and chaos engineering, some are listed below,

Exposes flaws in systems (e.g., architectural flaws)

Helps the team to improve system reliability.

Helps to gain confidence in the system / applications.

It helps to quickly recover from failures. Improves Mean Time to Detect (MTTD) & Mean Time to Repair(MTTR).

We can understand how the system behaves to real time pressure.

Helps developers, architects to understand their applications & systems better.

Brings a big shift in Culture of an organization. Brings “non-blaming culture” — the teams move from “Why did you do that?” to “How can we avoid doing that in the future?” — Resulting in happier & successful teams!

Source: Bing

The results are huge and rewarding. Why would one say NO when these practices directly impact your customer experiences, brand loyalty and OfCourse the revenues?

In the next part, let us talk about Microsoft’s advanced resiliency programs and Azure Chaos Studio, its concepts and a couple of quick demos. Till then thank you and stay tuned ….

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

--

--

Pradip VS

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.