Resiliency and Chaos Engineering — Part 8

9 min readApr 3, 2022

In the earlier parts I covered why failure is inevitable in the era of complex distributed systems and how resiliency engineering & chaos engineering helps improve the resiliency of the system and minimizes the failure. Resiliency and Chaos engineering is an ongoing effort and not a onetime testing practice. Kindly go through the earlier parts 1, 2, 3, 4, 5, 6 and 7 to understand it better.

In this part, I would like to share my thoughts, resiliency in multi cloud and why firms should initiate these programs at the earliest (having a dedicated Resiliency practice/CoE).

Conclusion to this series and my thoughts on what I observed in SRE while working with customers. Source of the pic

Out of the many customers I have worked with, only a few have a dedicated Resiliency CoE setup. There are various reasons why many firms have not thought about it and some of the reasons at least I heard are,

Lack of awareness on this stream.
I have HA, redundancy and AZ setup, so we are good with BC/DR and hence we do not need a SRE practice.
We do not have any mission critical applications, so downtime is not an issue for us.
We have a good health metrics collection system, so we do not need resiliency & chaos engineering practices.
We need to hunt for some resources in market, not an easy ask :)
Our company is planning a multi-cloud strategy and that should solve this purpose.

As we discussed in the previous parts, every firm in the age of distributed computing era needs a practice and it is better to adopt one at the earliest. There is no need to hunt for resources with specialized skills and a passionate software engineer can be a good person to start.

Now the question comes on adoption guide, the tools available in the market and how to start?

One of the ecommerce customers with whom I’m working very closely have created a SRE practice and did the following,

They chose the client facing applications (which are highly critical like the ecommerce portal) as a good candidate for resiliency and chaos testing. They did not choose all the applications in one go but rather split it like, a. critical and customer facing, b. critical but noncustomer facing, c. noncritical and chose to work with applications team to perform chaos testing.
They worked with individual application teams to come up with the following set of metrics to be captured. The team has to continuously / incrementally work to ensure they capture all information required to make their application less impactful during outages. They also have a criterion wherein they should be at a level 3 or above to run chaos engineering for this application (where level 1 is novice and no proper metrics captured or a DR playbook is built while level 5 is advanced with a DR playbook, proper metrics defined & automation scripts built).

A sample adoption guide which helps one to define the key metrics with quantifiable data to use in case of any outages to make the applications quickly available

3. Doing drills extensively in lower environments before progressing to the higher environments. For e.g., performing a Cosmos DB region failover or a Redis cache restart to see how their application behaves before gaining confidence to proceed into UAT & Prod.

4. The customer performed drills almost every week or alternate week around the year (not just before the holiday season).

5. It is advisable to choose a couple of tools to perform the chaos testing. I will cover in detail on the tool's selection part below. Since one tool cannot cater to every chaos testing scenario, it is better to consider two or three tools.

Before getting into the tools, lets clear some myth

Just by having redundant solutions, availability zones don’t make the applications resilient.
Just having a good health metrics system (or by enabling observability) alone will not solve the resiliency issues.
There will be some applications or systems in every organization that are highly critical, which cannot afford a downtime, so one has to do due diligence and apply resiliency principles to make sure failures are minimized.
Let me also brutally honest, doing just resiliency and chaos engineering does not ensure 100% protection against failures. These engineering principles identifies failure points and architectural gaps and the mindset to handle / solve issues (which is huge) but still the failures may/will happen, and these practices will give a mindset to solve problems in a quicker fashion or gives a mindset to be prepared for failures.
The chaos tools give the power to the engineering teams to think intuitively and fix the gaps. So just by having a few chaos testing tools without a strong engineering teams will not solve your resiliency problems.
Finally, embracing multi cloud is not a solution to this. In fact, resiliency will become hard to achieve in a multi cloud model, which I will cover below based on what I observed working with a few key customers.

Enough myth busters, the below are some of the popular tools which are used by various customers.

While working with customers I see a pattern where the teams use more than one tool for one reason — Cost of license. Say the customer with whom I’m working chose Gremlin and Chaos Toolkit. The reason is simple, since the company has certain budgets they cannot afford a costly license tool hence they are using a combination of tools to perform chaos engineering (highly critical departments with costly chaos tools while other departments use cost effective open source tools). While this is a good approach, one more thing we observed is that these tools does not fulfill the chaos testing needs of the customers fully. For example, Gremlin and Chaos Toolkit can do some testing like restarting pods, adding stress to VM’s or block the port 443 etc. But it cannot do a 429 simulation in Cosmos DB, even though some do a juggad (manipulate) and simulate 429s in the sdk / client side, it will not match the server side observations and hence the metrics will mislead. Similarly tools in market cannot simulate a Azure Cosmos DB region failover. This is where cloud native tools like Azure Chaos Studio comes into picture, which has most of the features provided by market leading chaos tools like pod restarts, VM CPU pressure increase or DNS failures as well as some simulations specific to Azure PaaS tools like Azure Cosmos DB region failover and other Cosmos DB failure scenarios like 429s (some are in pipeline) etc.

Here my recommendation is to choose tools that are both cost effective and covers all the scenarios testing. Say a combination of Gremlin + Azure Chaos Studio + Chaos Toolkit will cover most of the scenarios to experiment in a cost-effective way (Azure Chaos Studio is pay per use service like other PaaS services, hence there is no upfront cost involved. So, one can easily plan experiments within the given budget).

Source: Chaos Engineering tools comparison (gremlin.com)

This is a list put together by Gremlin and I have added Azure Chaos Studio to it. As mentioned above, there are many tools for chaos testing. However, it depends on the need, cost, capabilities required, the best chaos tool is determined. But one thing to note is, if you are in Azure, none of the tools can provide the features that Azure Chaos Studio provides for doing chaos testing on Azure PaaS components.

Finally, coming to Multi Cloud and the myth that it solves resiliency issues.

While embracing multi cloud gives a flexibility in choosing a vendor who can give a better SLA or improved HR/DR than other, reduces vendor lock in — costs — capacity issues etc. it has its own set of challenges (this is a very good blog by James Serra which outlines the merits and demerits of multi cloud — Single-cloud versus Multi-cloud | James Serra’s Blog)

One such thing that is worth talking on top of what James covered in his blog is achieving resiliency in multi cloud.

The customer with whom I engaged is planning to build their platform where some parts are in Azure while some parts are in other cloud still accessing the data from Azure Cosmos DB. This not only hits performance as James listed below but creates issues while adopting resiliency engineering practices.

First and foremost problem in multi cloud. James Serra Blog

If one claims that deploying the same solution in both the clouds improves availability and resiliency to region failures, then the same holds good when one do the same with multi region in a single cloud solution provider by redirecting the customers to other regions than to other CSP. Usually regions are isolated (be it power or network or cooling systems etc.). Having the same solution in two clouds will open other challenges to solve like cost, data sync issues etc. which are beyond the scope of this blog.

James Serra summarizes it very well.

Recently when Azure had an outage in West US, which impacted the VMs, the customers are asked to redirect their traffic toother regions like US South Central and US East etc. The ecommerce customer did mention that there is no impact to their Orders per Minute due to this outage. Hence the solution to an outage (as resiliency cannot be achieved 100% in one cloud) is not multi cloud but a multi-region within the same CSP. Multi cloud makes things harder. Below are some of the excerpts from analysts on multi cloud solving outages (when a recent AWS outage happened). This holds good whether you are in Azure or AWS or GCP.

AWS outage: Your response to AWS going down shouldn’t be multicloud | TechRepublic

Multi-Cloud is NOT the solution to the next AWS outage. | by Alexandre Couëdelo | Jan, 2022 | FAUN Publication

So, the availability/reliability cannot be solved fully with multi cloud and there are many ways to achieve the same with single cloud by bolstering the resiliency & chaos engineering practices.

To conclude this series, failure is unavoidable in this distributed — complex computing world and once has to embrace them than to avoid it. By embracing I mean designing systems keeping failures at the heart of design and build practices to minimize it. Resiliency and Chaos engineering principles helps systems to recover from failures quickly. There are several myths and considerations while adopting these principles which the architects and engineers should keep in mind. Finally, we saw multi-cloud or availability zones, redundancy or a good monitoring system alone will not improve systems reliability, same way having a few chaos testing tool cannot solve all the resiliency problems. Collectively all of these will help but not individually. If you are yet to adopt these principles/practices in your organization and need help in adopting it along with tools, kindly reach out to us (Microsoft) who can help adopt these practices with right advise and tools. This is a continuous journey and not a destination.

This is a great book to read if you are interested in this area — Chaos Engineering (oreilly.com)

Thank you for reading all the parts patiently and kindly leave your comments / feedbacks on how my posts can be improved. I will blog more on Azure Chaos Studio as and when it evolves & we solve more problems for customers. That will be blogged as separate set of articles. This is the end of this series and stay tuned for more in the upcoming days….

Pradip
Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

Resiliency and Chaos Engineering — Part 8

Written by Pradip VS