Resiliency and Chaos Engineering — Part 1

Pradip VS
5 min readMar 15, 2022

--

In the world of distributed systems, multi cloud / hybrid cloud operating model, one cannot expect that systems will perform without any failures. In fact, the new norm is, failures are inevitable, and one has to embrace it.

Failures can be embraced and overcome by resiliency engineering. How to ensure if the resiliency patterns defined by the architects are working? Well, Chaos engineering helps evaluate that. Chaos engineering is used to test if the resiliency patterns defined are good enough to make the systems resilient to disasters.

Resiliency and Chaos Engineering

In this series, I will cover the following on Resiliency and Chaos Engineering.

  1. What are some of the popular disasters in 2021? What outage means for a company?
  2. Why organization fear failures? What causes failures?
  3. Is failure a bad thing? What is the view of pioneers / industry experts on failures?
  4. How to embrace failure and improve through resiliency?
  5. How to achieve continuous resiliency? The architectural patterns to be followed in improving it?
  6. The fire fighters and how they are related to chaos engineering?
  7. Chaos Engineering — the art of breaking things purposefully.
  8. How chaos engineering improves resiliency?
  9. Phases in chaos engineering
  10. Who can be a chaos engineer? What is the best way to start?
  11. Benefits of chaos engineering
  12. Microsoft’s initiatives on Advanced Resiliency
  13. Azure Chaos Studio — Microsoft’s chaos engineering tool.
  14. Azure Chaos Studio — Concepts and Demo
  15. Final thoughts

I will also briefly mention on how top tech companies are embracing it to minimize outages and making their systems more resilient day by day. I will draw some inferences from the projects that I’m engaged on this area. The article on this series expresses my personal views and not that of my employer Microsoft.

I will take this opportunity to wholeheartedly thank Mark Russinovich, CTO Azure, Adam Hornsby, Principal Technologist, AWS as their blogs / thoughts helped me learn so much in this area. I would like to thank my team on supporting me in this journey.

Enough talk, lets start uncovering the agenda. In this part, I will talk on points 1 to 3.

  1. Popular Disasters in 2021 and What are the financial implications due to outages?

Despite the cloud vendors, software engineers & architects building systems and apps adhering to best practices & bringing resiliency in each phase of the system lifecycle, outage still occurs impacting the customers and revenues. The following were Popular outages in 2021,

Major Outage at Amazon Disrupts Businesses Across the US | Business News | US News

A popular ecommerce customer had an issue with order id that caused the ecommerce site to go down for almost 6 hours a few weeks before Black Friday in 2021.

Outages occur now and then but how to recover quickly from it is something we should focus on and that is the whole purpose of this series.

The next question is, what are financial implications due to an outage?

Source: IDC and the Ponemon Institute

Whether it is an ecommerce giant or a cloud computing firm, an hour of outage and downtime of critical services will impact customer experience / loss of trust and revenue.

Obviously no one prefer outages. But there are reasons why organization fear failures.

2. Why organization fear failures? What causes failures?

When an outage happens the following is the life cycle followed by many firms / engineers. They will first detect the failures followed by understanding & evaluating them, respond, resolve, recover and finally confirm & close. But there is one more phase apart this…. BLAME PEOPLE.

You, You and You! — Source

It goes like this

Detect → Understand & Evaluate → Respond, Resolve, Recover → Confirm & Close → Blame People

What is the reaction of people and organizations that experience fear?

It is to avoid it, naturally.

The second part is, what causes failure?

Often there is one team or individual blamed for failure but usually such failures are due to multiple causes. Usually the failures start simple but it compounds and results in a disaster one fine day.

Failures have multiple causes.

3. Let us come to the third part of disasters… Is failure a bad thing? What is the view of pioneers / industry experts on failures?

Here are some views of industry experts on failures esp. in the world of distributed and complex systems.

We will never eliminate all such risks (outages, failures) but we are deeply focused on reducing both the frequency and the impact of service issues while being transparent with our customers, partners, and the broader industry — Mark Russinovich CTO and Technical Fellow, Microsoft Azure

“Everything fails all the time”

“Failures are a given and everything will eventually fail over time.” — Werner Vogels, CTO, Amazon.com

“Anything that can fail will fail” — Murphy’s law

“The complexity of these systems makes it impossible for them to run without multiple flaws bring present.” — Richard Cook, How Complex Systems Fail.

Precisely this summarizes point #3

“You don’t choose the moment, the moment chooses you! You only choose how prepared you are when it does.”

Fire Chief Mike Burtch

Source: Bing

So outage can occur anytime and for any reason but our goal is to be prepared for it.

In summary, failures and outages can and will occur in complex systems esp. in the era of modern cloud computing & distributed systems. So the best way is to embrace failures than avoiding it.

But then the next and most important question is how to embrace failures? What one should follow?

There are many best practices one has to follow (Resiliency Engineering / Continuous Resilience practices along with Chaos Engineering) to overcome failures. In the upcoming series I will explain them in detail and how this will minimize failures, makes a system to recover quickly and mainly how this brings a big shift in organization’s culture resulting in happier and successful teams.

Link to Part 2 is here

— Thanks & Stay tuned

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

--

--

Pradip VS

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.