Resiliency and Chaos Engineering — Part 7

Pradip VS
6 min readMar 31, 2022

--

In this part, we will talk about Microsoft’s advanced resiliency programs & Azure Chaos Studio concepts with few demos. Even though Azure Chaos Studio is in preview, we will cover why it will become your one-shot tool down the line to perform Chaos Engineering esp. if you are in Azure.

This part is in continuation to part 1, 2, 3, 4, 5, and 6. Kindly go through them to get a broader context.

Azure Chaos Studio — Logo

Microsoft Azure offers cloud computing in 60+ regions, which is more than any other cloud provider. However, with scale comes a big challenge. How to ensure the regions are resilient to failures, when I say regions, it is not just the whole region but the individual components in it. Say how are we going to build or advance the resiliency in VM’s? How to monitor the VM availability? or the resiliency of AAD? or failure prediction and mitigation?

Keeping the complexity of cloud computing in 60+ regions in mind, Microsoft has initiated multiple projects to advance the resiliency of various components. Some of the key ones are listed below,

  • Live Migration — Improving VM resiliency — 2018
  • Project Tardigrade — Improving VM resiliency through memory-preserving soft kernel reboots — 2019
  • AIOps vision — Improving service quality using AI — 2020
  • Project Narya, an end-to-end prediction, and mitigation service. — 2021
  • Project Flash — Azure VM availability monitoring advancement — 2022
  • Well Architected Framework (WAF)
  • Azure Chaos Studio and Many More

Our topic of interest for today is Azure Chaos Studio. One can go through the product documentation, so I don’t wish to repeat it. I would like to cover the core concepts and some of the demos & its outcome. In future, you can see more blogs on this space as and when Azure Chaos Studio matures and becomes GA (covering the new features and how to test various scenarios and learnings while doing it in customers UAT/Prod environments).

Azure Chaos Studio

Azure Chaos Studio lets you perform Chaos Engineering experiments to inject faults into your service, and then monitor how the service responds to the disruptions.

Azure Chaos Studio in a Nutshell
Chaos Studio Core Functionality

Core concepts of Azure Chaos Studio

The four key concepts are,

A Chaos experiment consists of four key steps.
  1. Chaos experiments (Selectors and Logic)

Selectors are groups of target resources that will have faults or other actions run against them.

Logic describes how and when to run faults. It consists of steps and branches

2. Faults and actions (Faults, Time delays)

Faults — two types

  • Agent-based — fault injected through an agent that needs to be installed on a virtual machine or virtual machine scale set.
    or Service-direct — run directly against an Azure resource.
  • Continuous — increasing CPU pressure for 15 min continuously
    or Discrete — perform actions like restarting a service.

Time delays — action “waits” without impacting any resources. It is useful for pausing in between faults to wait for a system to be impacted by the previous fault.

3. Targets and capabilities

Target interact with a resource for a particular target type. A target type represents the method of injecting faults against a resource.

Capability enables Chaos Studio to run a particular fault against a resource, such a shutting down a virtual machine.

4. Permissions and security

A chaos studio cannot be operated by anyone as it leads to malicious attacks. So, the experiments with right permissions/roles on the resources can perform chaos experiments. One can use UAMI/SAMI for the same.

All the targets are listed, and one can enabled a service direct or an agent-based faults on the given target.

The experiments can be run in sequence or in parallel. The fault libraries supported are constantly expanded with more features across the entire Azure stack.

Disruptions at three levels viz. Network, Resource and Configuration

Azure Chaos Studio experiments

Adding a CPU pressure of 95% for 10 min on a given VM using an agent (stress ng)
You can choose the VM where the pressure has to be applied

The only pre-requisite is you need to perform an agent-based failure on VM or VMSS or AKS then the agent has to be installed in that VM / VMSS / AKS. This agent installation is not required for service directs.

Metrics from VM confirms CPU pressure for 10 minutes at 95%.

The Azure Chaos Studio created a CPU pressure of 95% for 10 min on a given VM through an agent (stress ng) and Metrics from VM confirms this.

Chaos experiments can be invoked through REST API, and it can be automated (using Azure Logic Apps and other tools) to avoid manual intervention.

Azure Logic Apps invoking the Chaos Experiment directly on a given schedule through a REST API endpoint.
Adding a suitable role in Access Control (IAM) to run chaos experiments on that resource

A resource can be injected with fault only if a suitable role is added. Go through this URL to understand more about roles that are required for Chaos Studio to run experiments on various targets.

With an Azure Subscription you can try these demos yourself,

Create an experiment that uses a service-direct fault with Azure Chaos Studio | Microsoft Docs

Create an experiment that uses an agent-based fault with Azure Chaos Studio with the portal | Microsoft Docs

Create an experiment that uses an AKS Chaos Mesh fault using Azure Chaos Studio with the Azure portal | Microsoft Docs

A few things to consider at this point of time are,

Chaos experiments can target resources in a different subscription than the experiment as long as the subscription is within the same Azure tenant.

Chaos experiments can target resources in a different region if the region is a supported region for Chaos Studio.

Azure Chaos Studio does NOT support Service Tags or Private Link.

These are some of the videos from Azure Chaos Studio team, which I recommend you all to watch,

To conclude, Microsoft is constantly upgrading the resiliency of the cloud infrastructure with various projects and on various components. Azure Chaos Studio is a managed offering from Azure to perform chaos experiments on various resources. Kindly get started with Azure Chaos Studio and comment your feedback on what features you are looking for and how we can improve this product. In the next part, I will close this series with my final thoughts and will multi cloud solve resiliency problems? I will blog a new series on the experiments and learnings obtained from Azure Chaos Studio while running it through various client's platforms. Till then Thank you for reading my thoughts, experience and stay tuned for more…

Pradip
Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

--

--

Pradip VS

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.