Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 2

Published in

Microsoft Azure

7 min readJan 12, 2024

Depiction of Chaos in Networks — Digital Art — Bing Image Creator

This blog is co-authored with Vijayaraghavan Lakshmikanthan, Cloud Solutions Architect — Microsoft

In the previous part (link), we discussed the environment setup. In this part, we will cover experiments that we ran at various layers, the outcomes and learnings. We will start with Network Latency then proceed to Network disconnect and end with NSG Security Rule. A couple of AKS related fault injections in next part.

In this part we will deep dive into specific of the following 3 experiments. For the initial steps of creating an experiment, adding branches refer the article here as this is similar to for all experiments.

Network Latency
Network Disconnect
NSG Security Rule

Network Latency

The aim of this fault is to increase the network latency between the application and the backend database which in our case is Azure SQL DB.

In this experiment, let us review the parameters one by one.

Network Latency Experiment — Azure Chaos Studio Experiment page

Once the settings are done, select the target resource which in this case will the VM Scale Set instance where the fault will be injected.

We can choose the target from the list or alternately use KQL query to select targets at runtime.

You can choose the target from the list or if you want to choose the target dynamically, you can use KQL to select at runtime.

Once the selections are made save the experiment. You can add more branches or steps in the same experiment (for e.g. you want to add a delay of 10 minutes followed by another 10 minutes of network latency to simulate a scenario where the connection is good for 10 minutes but experiencing latency on and off for 10 minutes.)

Action is set. You can add more actions or more branches. You can run parallel faults and not all scenario demands a sequential run of experiments.

Before running the experiment, ensure the experiment created are added to target resource(s) with the right role (necessary permissions are given). Refer the given link for the roles based on the resource type — Supported resource types and role assignments for Chaos Studio | Microsoft Learn

Now in the Overview tab, Start the experiment and observe the application by trying to connect it to the database and by loading tables.

When the app loads it takes 10000+ milliseconds to load.

The experiment works as expected and it takes 10k milliseconds to load the page with data from SQL DB. The data loads eventually however with a lot of latency.

With this experiment, we can test how the app behaves with different latencies and plan a failover to a secondary region, if needed, depending on the sensitivity of the application and the business impact of the latency.

Similarly, you can block other ports or networks that your app uses (for e.g. that connects to a cache or other databases) to determine the overall behavior and plan a fallback / alternate path for it.

Network Disconnect

In this experiment we completely block network traffic to specified destination and port range.

The fault settings are similar to Network Latency. In the previous use case, we were able to load data from the SQL database after a delay but in this case the application will not load any data from SQL database as this experiment will disconnect the connection between the app and the SQL DB on port 1433.

At least one destinationFilter or inboundDestinationFilter array must be provided. We are blocking the outbound and not the inbound.

When the experiment is executed on the same VM Scale Set the following behavior is observed. You will get a message “Error loading data from DB”. Although, the page is loaded no data is retrieved or listed as we are blocking communication to Port 1433.

You will get a message “Error loading data from DB”. The page is loaded as we are only blocking port 1433, which impacts the database load only.

Here, if the details are critical then you can plan an alternate path like a fallback datastore which can give the information until the main database recovers.

Consider a real-life example wherein an e-commerce websites shows by when a product can be shipped. Customers may make a purchase decision based on the same. If the connectivity to the database from where this information is retrieved is impacted an alternate database must be able to provide the approximate delivery date by when the goods will be delivered instead of showing the delivery date as blank or an error message as it will result in a negative user experience. (Many customers purchase items keeping delivery date in mind)

A sample screenshot. If the database is not accessible due to port issues or any network disconnects for a specific port, then use a backup database that is connected to a different port as fallback to retrieve a range date to ensure the customer gets delivery in that date range than leaving it as blank / error.

NSG Security Rule

This experiment will enable manipulation or rule creation in an existing Azure network security group (NSG) or set of Azure NSGs, assuming the rule definition is applicable across security groups.

Useful for:

Simulating an outage of a downstream or cross-region dependency/non-dependency.
Simulating an event that’s expected to trigger a logic to force a service failover.
Simulating an event that’s expected to trigger an action from a monitoring or state management service.
Using as an alternative for blocking or allowing network traffic where Chaos Agent can’t be deployed.

The difference between this experiment vs the previous one’s is that instead of injecting the fault at Virtual Network we will be injecting the fault within a Network Security Group so you can simulate the impact when a rule is misconfigured and according develop a failback plan to handle such events.

When the experiment is run, one can observe the site (app) will NOT open at all (even the login page will not load) or in case the application is already loaded you will see the “connection lost, trying to reconnect” message from the application.

page is showing error as connection lost and trying to reconnect.

Observations and learnings

In these set of experiments, we understood that since network forms the backbone in every environment and any issues at a network layer impacts the entire systems which in turn can severely impact the user experience our recommendation is twofold,

Use availability zones wherever possible such that your infra is deployed in two datacenters instead of just one and thereby if one data center faces crisis, your app can bring the data from other DC without much impact to the latency. Similarly, you can have one more region where you can failover in case of region outage.
The second recommendation is using a fallback plan (as depicted in the retail scenario above. If there is network impact say your network team has deployed a new feature and it is impacting some selective ports but not the entire network). Then having a fallback option with another database on another port range will help avoid user impacts and gives sufficient time to app teams to switch regions or engineering team to roll back the feature.

We can plan more scenarios like bringing the data from cache or hiding that service (for e.g. if recommendation system is impacted due to some ports issue, then you can hide that section as they are not mission critical or replace with another service like frequently purchased or top deals for the day if the recommendation system fails. Refer this article)

In case of a region outage and if your app is critical, you can have a similar setup in another region with all the following setups.

Having a secondary Azure SQL DB with active geo-replication and RA-GRS will not only give high durability but you can also split the workload to another region if there is a constant read requests coming from a set of users / app. Active geo-replication uses the Always On availability group technology to asynchronously replicate the transaction log generated on the primary replica to all geo-replicas. This helps when you have to switch to another region immediately as you can recover most of your data with very minimal loss.
Even though we do not have a fault that can be injected directly into Azure SQL DB, we can follow some of the best practices like using business critical tier, tuning the database by optimizing queries, managing indexes, updating statistics and frequently monitoring the systems (Regular Maintenance) can improve the resiliency of Azure SQL.
Similarly, you can have a caching layer with active-active geo replication setup (Azure Redis for Enterprise gives this feature) as discussed in the other use case (Link).
Since the app deployed on AKS is stateless, one can switch to other regions and enabling auto-scale will help to quickly scale to meet use demands.

These are some of our observations and our recommendations and in the next part, we will focus on experiments that are similar but runs in the context of Azure Kubernetes Services (AKS) and how they behave compared to this along with our observations and learnings.

Link to Part 3 is here — Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 3 | by Pradip VS | Jan, 2024 | Medium

Any comments/feedback, please post it and will be glad to address it.

-Pradip VS, Cloud Solution Architect, Microsoft.

Building Resilience using Azure Chaos Studio (AKS with SQL in VNET) — Part 2

Network Latency

Network Disconnect

NSG Security Rule

Observations and learnings

Written by Pradip VS