Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 1

Pradip VS
Microsoft Azure
Published in
8 min readJan 12, 2024
Azure Chaos Studio — Improve application resilience by injecting faults at various layers of the architecture and simulating chaos or outages. More at — Azure Chaos Studio — Chaos engineering experimentation | Microsoft Azure

This blog is co-authored with Vijayaraghavan Lakshmikanthan, Cloud Solutions Architect — Microsoft

It is good to be back with Azure Chaos Studio and this time it is more of demonstrating two use cases end-to-end / architectures, which is typically used by various organizations for deploying their key services.

We will be running chaos experiments at various layers in the architecture to show how the application behaves for each experiment and learnings from it to improve the resiliency posture. One of the use cases can be read from here.

Before we do a deep dive in to architecture, Let’s hear some good news “Azure Chaos Studio is GA! (Announced in MS Ignite 2023)

GA announcement — Azure Chaos Studio (microsoft.com)

You can check the regional availability and deploy your experiments! (Watch out this space as more regions will be added in future) — Azure Products by Region | Microsoft Azure

Pricing of Azure Chaos Studio — Azure Chaos Studio — Pricing | Microsoft Azure

Azure Chaos Studio documentation — Azure Chaos Studio documentation — tutorials, API reference | Microsoft Learn

The resiliency of a system can be improved by Azure Chaos Studio and here is an article that you can refer — Resiliency and Chaos Engineering — Part 7 | by Pradip VS | Medium. There are many features coming in the Azure Chaos Studio, with it one can test many scenarios and improve the resiliency of the systems. Not only this, but Microsoft has a grand vision for resiliency, and we are improving the resiliency of our massive data centers day by day with various initiatives and projects. Watch out this blog for more — Advancing Reliability | Microsoft Azure Blog | Microsoft Azure

In this blog, we will walkthrough:

  • Introduction to the Chaos Studio experiments that we will be covering.
  • The environment and different Azure Services that you would require for these chaos studio experiments.
  • Creating the AKS Cluster and the commands to run.
  • The application to simulate the use-cases, deployment of the App.
  • SQL queries used for these experiments.
  • Enabling the Chaos Studio Targets.
  • Describe & demo the experiments with outcomes & key takeaways.
  • Explain how these experiments are showing gap in resiliency posture of various applications and some best practices to improve.
  • Wrap!

*-All the programs and commands used in this experiment are in GitHub and the URL is shared in this article at appropriate sections.

In this end-to-end series, let’s talk about the below architecture and its flow. (All the components in this architecture including the Chaos experiments are hosted in Azure Southeast Asia region)

Chaos Studio Experiments

In this series we will cover some 5 experiments specific to Azure Kubernetes Service and Virtual Network.

  1. Network Latency
  2. Network Disconnect
  3. NSG Security Rule
  4. AKS Chaos Mesh IO Chaos
  5. AKS Chaos Mesh Network Chaos
End to End Flow of this architecture and our experiments at different layer.

Preparing the Subscription

To start, we need to activate two resource providers to run some of the use cases:

  1. Microsoft.ContainerInstance
  2. Microsoft.Relay

This article Resource providers and resource types — Azure Resource Manager | Microsoft Learn explains how the resource providers can be enabled.

Setting up the Environment

You have the option of using a current environment or creating a new one to try out these test cases. If you would like to do the test cases in a separate environment, this document will help you set up one. We also provide a sample application that concentrates on implementing the use cases that we will discuss in this post.

To get started we will need the following:

  1. Azure Network Security Group
  2. Azure Virtual Network
  3. Azure SQL Database
  4. Azure Container Registry
  5. Sample Application
  6. Azure Kubernetes Service
  7. Azure Virtual Machine

Azure Network Security Group

Set up a Network Security Group. Make sure that the Network Security Group permits the Inbound & Outbound Chaos Studio NSG Rules.

You also need an NSG rule that lets inbound traffic reach the application that runs on AKS.

Virtual Network

First, we will create a virtual network. Then, we will create a subnet for each of the following: Virtual Machine, Kubernetes Service and the Azure SQL DB. Connect the subnets to the Network Security Group that you created before.

We will also need two additional subnets called ChaosStudioContainerSubnet and ChaosStudioRelaySubnet.

A container subnet is used for the Chaos Studio containers that you deploy in your private network. These containers execute the faults that you define in your experiments, and they need to communicate with the target resources that are also in the private network. A container subnet requires at least /28 in the address space.

You will also need to set the delegation on the ChaosStudioContainerSubnet to the Microsoft.ContainerInstance/containerGroups service. Add or remove a subnet delegation in an Azure virtual network | Microsoft Learn

A relay subnet is used to forward communication from Chaos Studio to the containers inside the private network. This is necessary because Chaos Studio is a public service that cannot directly access the private network. A relay subnet also requires at least /28 in the address space. When you are done, the subnets on the virtual network should look like this.

Azure SQL Database

Create an Azure SQL DB. As you create it, make sure to set up a private endpoint. Pick one of the subnets that you made before for this purpose. In additional settings for ‘Use existing data’ select ‘Sample’. Sample will fill the database with AdventureWorks dataset. After the database is ready, you will need to put in an extra table called ‘dbo.todo’. You can use this query to create the table:

CREATE TABLE todo (id INT PRIMARY KEY, description VARCHAR(255), details VARCHAR(4096), done BIT);

You can refer to this link here https://learn.microsoft.com/en-us/azure/azure-sql/database/single-database-create-quickstart?view=azuresql&tabs=azure-portal for detailed instructions.

Azure Container Registry

Create an Azure Container Registry where you can upload the sample application. You will reference this Azure Container registry during provisioning of the Azure Kubernetes Service.

Sample Application

We have created a JAVA application that accesses the Azure SQL DB and fetches all the records from the DB. The application will do the following — 1. load a subset of records, 2. load all records, 3. add records to the database and 4. write logs to the azure storage disk. You can find the application here GitHub — faizc/chaos-app: App to simulate scenarios using Chaos Studio. After modifying it for your environment, upload it to the Azure Container Registry that you set up before.

Azure Kubernetes Service

Next, setup a Private AKS Cluster with version 1.25.1, node pool size of 2 to 5 with autoscaling, node sizes of Standard_D4as_v4, Ubuntu Linux OS, Azure CNI as Network Profile and Calico for Network Policy.

In ‘Integration’ section choose the Azure Container Registry that you created earlier.

In order to run the chaos experiments Chaos Mesh needs to be installed on the resource where we want to inject the fault (in our case its AKS) and steps to deploy Chaos Mesh in AKS can be found here.

Azure Virtual Machine

Create a Virtual Machine to manage the Private AKS Cluster. You can use an existing VM as long as that would have the network connectivity to allows you to manage the AKS Cluster.

To Remote into this Virtual Machine, you can consider using Azure Bastion service. This will require you to create another subnet. You can refer to this link for detailed instructions: Quickstart: Deploy Azure Bastion automatically — Basic SKU — Azure Bastion | Microsoft Learn

Enabling Chaos Studio Targets

Now that your environment is all setup and ready you must enable these resources as targets on the Chaos Experiments before we can start our experiments.

Navigate to Chaos Experiments in Azure Portal and select the Targets blade, you can find all the resources that we have created listed here. Select the resources and ‘Enable Targets’. Depending on the type of resource selected you will have the choice to ‘Enable service-direct targets (All resources)’ and/or ‘Enable agent-based targets (VM, VMSS).

Depending on the target one can enable agent based or service direct faults

This is because the faults are either agent-based or service-direct depending on the target type.

An agent-based fault requires the Chaos Studio agent to be installed on a virtual machine or virtual machine scale set. The agent is available for both Windows and Linux, but not all faults are available on both operating systems. For information on which faults are supported on each operating system, see Azure Chaos Studio fault and action library | Microsoft Learn.

Service-direct faults don’t require any agent. They run directly against an Azure resource.

For VM and VMSS (for AKS) enable both types of faults. For NSG and AKS enable Service-Direct faults. The target's view would show the status of the target as ‘Enabled’, ‘Not Enabled’ or ‘Not Applicable’.

The targets & experiments can be enabled/created through portal as well as programmatically — Azure Chaos Studio REST API reference | Microsoft Learn, Use the REST APIs to manage Azure Chaos Studio experiments | Microsoft Learn.

Creating the Chaos Experiments

Once the targets are all enabled, the next step is to create experiments.

We have built many experiments to demonstrate the various capabilities of Azure Chaos Studio at every layer of architecture, but we will take five of them and do a deep dive into it.

In the upcoming parts, we will showcase how each experiment is setup (the config), demo, outcome and key takeaways.

Azure Chaos Studio — Experiment Screen

Link to Part 2 is here — Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 2 | by Pradip VS | Jan, 2024 | Medium

Any comments/feedback, please post it and will be glad to address it.

-Pradip VS, Cloud Solution Architect, Microsoft

--

--

Pradip VS
Microsoft Azure

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.