Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 3

Pradip VS
Microsoft Azure
Published in
4 min readJan 12, 2024
Depiction of chaos in Azure Kubernetes Services — oil paint style — Bing Image Creator

This blog is co-authored with Vijayaraghavan Lakshmikanthan, Cloud Solutions Architect — Microsoft

In the previous part (part 2 ), we saw how the network-based fault injections work.

In this part, we will two cover experiments that will inject faults at AKS layer viz. AKS Chaos Mesh IO Chaos and AKS Chaos Mesh Network Chaos.

Under the hood, these experiments will leverage Chaos Mesh and it needs to be installed on the resource where we want to inject the fault (in our case its AKS) and steps to deploy Chaos Mesh in AKS can be found here.

AKS Chaos Mesh IO Faults

The aim of this fault is to cause an IO fault to run against your AKS cluster. This will help re-creating AKS incidents because of IO delays and read/write failures when you use IO system calls such as open, read, and write.

In our experiment we will write logs into Azure Storage Disk, and we will see how this experiment impacts that operation.

AKS Chaos Mesh IO Fault

The input to the fault is provided in the form jsonSpec. It defines the fault that will be injected into the AKS. You can read more about the I/O fault here — Simulate File I/O Faults | Chaos Mesh (chaos-mesh.org)

Let us briefly look the parameters here.

AKS IO Faults Parameters.

Once the experiment is kickstarted, we can find delays in writing logs to the disk by the mentioned latency. Latency in writing logs to disk esp. when the logs are huge can result in potential problems in real-life like,

performance impact on the application, data recovery issues, throughput reduction and overall impact to application that gives a degraded customer experience.

To overcome this impact, the following are some of the recommendations.

Perform asynchronous logging, rotating and compressing the log files can help mitigate fragmentation, batching writes, using Write-Ahead Log (WAL) technology, managing log file size (many small files can be gracefully managed than a single big file).

AKS Chaos Mesh Network Faults

This fault causes a network fault to run against your Azure Kubernetes Service (AKS) cluster. Useful for re-creating AKS incidents that result from network outages, delays, duplications, loss, and corruption.

AKS Chaos Mesh Network Fault

The input to the fault is also provided in the form jsonSpec. It defines the fault that will be injected into the AKS. You can read more about the Network fault here — Simulate Network Faults | Chaos Mesh (chaos-mesh.org)

Let us briefly look the parameters here.

AKS Network Chaos Faults.

When the experiment is run, it injects latency into the AKS and the difference here compared to the network latency is, here in AKS the network latency impacts all of the operations and is not bound to any ports, so it impacts the app on a whole and not just specific operations like the app talking to Azure SQL DB through port 1433 on a specific IP, which the network latency and disconnect faults did. This experiment helps one simulate how the network fault inside an AKS impacts the app talking to various other downstream systems like a DB or cache.

Our recommendations to improve resiliency in AKS to overcome these network related issues especially for apps that are sensitive to latencies are by using replica sets (Provisioning multiple instances improves both resiliency and scalability. When hosting in Azure Kubernetes Service, you can declaratively configure redundant instances or the replica sets in the Kubernetes manifest file), leverage accelerated networking if not used, use Proximity Placement Groups (PPG) where you can allocate azure compute close to each other to reduce the network latency, using availability zones etc.

AKS network related best practices can be read from here.

You can find some of the best practices to improve the platform resiliency from here.

All AKS best practices are collated here.

We are concluding this series after discussing about five experiments, our observations during these experiments and best practices to be followed to improve the platform resiliency. By following Azure best practices at each layer and during the very early phases (plan/design/build) of product lifecycle, one can improve the resiliency significantly than at the later stages. The following guide will help one the build a better cloud application — Best practices in cloud applications — Azure Architecture Center | Microsoft Learn

Last but not the least, we run many Azure Well Architected Framework sessions with our customers and have improved resiliency, performance, security, operational excellence with reduced cost.

We will request you to share your scenarios and how you improved resiliency in your apps in the comments section.

Happy to discuss on Azure Chaos Studio and various architectural patterns you have setup to improve resiliency.

URL of the app — faizc/chaos-app: App to simulate scenarios using Chaos Studio (github.com)

AKS commands used — akscommands/chaosapp-akscommands at main · VSPradip/akscommands (github.com)

— Pradip VS, Cloud Solution Architect — Microsoft.

--

--

Pradip VS
Microsoft Azure

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.