Building Resiliency using Azure Chaos Studio (AKS with Cosmos DB and Redis Cache)

Pradip VS
Microsoft Azure
Published in
11 min readDec 28, 2023
A digital art depicting Chaos in databases — Bing Image Creator

This blog is co-authored with Faiz Chachiya, Sr. Cloud Solutions Architect — Microsoft

It is good to be back with Azure Chaos Studio and this time it is more of demonstrating two end to end use cases / architectures, which is typically used by various organizations for deploying their key services. We will be running chaos experiments at various layers in the architecture to show how the application behaves for each experiment and learnings from it to improve the resiliency posture. Another use case can be read here.

Before we do a deep dive in to architecture, Let’s hear some good news- “Azure Chaos Studio is GA! (Announced in MS Ignite 2023)”

GA announcement — Azure Chaos Studio (microsoft.com)

You can check the regional availability and deploy your experiments! (Watch out this space as more regions will be added in future) — Azure Products by Region | Microsoft Azure

Pricing of Azure Chaos Studio — Azure Chaos Studio — Pricing | Microsoft Azure

Azure Chaos Studio documentation — Azure Chaos Studio documentation — tutorials, API reference | Microsoft Learn

The resiliency of a system can be improved by Azure Chaos Studio as it gives various experiments to test your systems end to end. Refer this article for more — Resiliency and Chaos Engineering — Part 7 | by Pradip VS | Medium. There are many features coming in the Azure Chaos Studio and these features will let you test various Azure services, which is not possible with other tools available in market. Not only this, but Microsoft also has a grand vision for resiliency, and we are improving the resiliency of our massive data centers day by day with various projects. Watch out this blog for more — Advancing Reliability | Microsoft Azure Blog | Microsoft Azure

In the last set of articles (Links), we have described how the user experience gets impacted while accessing the application on top of the app deployed in AKS that fetches data from Azure SQL DB, writes logs to disks, hosted in a virtual network when we induce various chaos experiments at various layers of the architecture.

In this article, we are going to focus on a similar architecture but focusing only on experiments that will be induced on Azure Cosmos DB and Azure Redis. Here is the architecture, config and flow,

  1. The app is hosted on an AKS with the following config, version 1.25.1, node pools of 2 to 5 with autoscaling, size — Standard_D4as_v4, private cluster — enabled, OS — Ubuntu Linux, Network profile Type — Azure CNI, Network Policy — Calico. The AKS is hosted on a Virtual Network. The app calls the cache and the Cosmos DB to fetch the data. GitHub link for the app — (published soon) and AKS commands used — akscommands/chaosapp-akscommands at main · VSPradip/akscommands (github.com)
  2. The data is hosted in Azure Cosmos DB, which consists of the following configuration — A database with different containers and each container consists of individual throughput assigned. Here, we have set 100 to 1000 RUs with autoscale mode. The data is replicated between West US 2 and East US 2 (Single Write Master WUS 2 and EUS 2 is read region, and it can be promoted to Write if WUS 2 undergoes any issues).
WUS 2 is write and EUS 2 as read region.

3. Azure Cache for Redis is hosted in WUS 2 with Premium P1 (6 GB Cache, Replication) tier with one shard (total size 6 GB). The data in the cache is hydrated by the app hosted in AKS, where the TTL is set to 60 seconds (every 60s the cache data is refreshed by the data from Cosmos DB through the app. You can change the TTL based on the app sensitivity to stale data by doing rigorous testing and arriving at a time where it should not overwhelm the app / Cosmos DB as well as cater to the user who needs latest data and not the stale one). The cache is set in only one region for the demo but in reality, it can be in single region or multi-region depending on the app criticality.

Azure Cache for Redis with Premium Tier 6 GB — one shard

You can find a separate blog that covers building resiliency through Caching in detail from here.

4. We have hosted all of the components in a virtual network and no other systems except the components within the subnet and the services that are whitelisted can access the app.

5. Below is the architecture that we have setup for the demo and the flow is, the apps deployed in AKS will first reach out to cache for data and if not available it will go and directly fetch from Azure Cosmos DB. Cache is constantly hydrated every 60s by the app.

Architectural setup for our demo

Now comes the scenario that we want to test it out with the experiments, what if both Redis and Cosmos DB are down? what is Redis is undergoing some shard restarts? Will your secondary Redis and Cosmos DB (from another region) can handle it? Will my cross region calls impact the app latency? How to gracefully move the traffic to another region or DC when both my Cosmos DB and Redis are down? How NOT to overwhelm my database when cache is down? How to build resilient apps for such scenarios? Well, we can determine with this article.

Note: Since we have already covered the experiments like network latency, disconnects, NSG Security Rule and AKS based fault injections already (Links — will be published soon), we will focus only on fault injection in Azure Cosmos and Azure Redis.

You can also run a couple of experiments on Azure Key Vault to see how app behaves when AKV is down or denied access. It is not covered in this experiment, but you can do it programmatically based on this article.

To build experiments from scratch please refer here, which is similar to all experiments. All the experiments can be built programmatically and please refer these articles which explains how to build Chaos Experiments on Redis both through Portal as well as through Rest API.

Rest API:

Injecting Fault in Azure Cache for Redis using Azure Chaos Studio through Rest API (Part 2) | by Pradip VS | Microsoft Azure | Medium

Portal:

Injecting Fault in Azure Cache for Redis using Azure Chaos Studio | by Pradip VS | Microsoft Azure | Medium

Now let us jump into this experiment,

Once the targets are enabled with Service Direct Faults, you can start building experiments that can run on these services.

Faults enabled for Azure Cosmos DB and Azure Redis.

The experiments that are run will be restarting the Redis Cache shards on “AllNodes” (we can inject fault only on primary or secondary or all). We can restart the shards in sequence or in parallel depending on the number of shards and how the application is setup. Below is the setup of the fault and here we are restarting only shard 0. The Target Resource is set to the Azure Cache that we are using in this app setup.

The second experiment is Cosmos DB Failover, which will fail the primary instance of Cosmos DB and make the secondary or one of the secondary instances (that we define) as read region. In the below setup we are failing over a Cosmos DB instance that is in West US 2 (primary) to East US 2 (secondary).

Before running the experiment, ensure the experiment created are added to target resource(s) with the right role (necessary permissions has to be given). Refer the given link for the roles based on the resource type — Supported resource types and role assignments for Chaos Studio | Microsoft Learn

Now both the experiments are running, and Redis cache shard restart will happen fairly quick while Cosmos DB will take a couple of minutes to failover.

When the experiment is kickstarted, we tried to hydrate the cache and got the following message “Error Loading from Cache” (the screen shows already loaded message). We have only one shard and in one region. So, any problem to that shard, the cache will be inaccessible for some time. This makes the traffic to get routed to the database to fetch the data. If the hits are high, then it becomes overwhelming for the database to handle such huge read requests.

When the data is hydrated from Cache, one can notice a pop-up that error loading from cache.

So, we are now trying to load it from Cosmos DB as the cache failed. We can see now the data is fetched and it is from the primary region West US 2.

the data is fetched from Cosmos DB West Us

Now the experiment is running,

and the failover is successful, where East US 2 is promoted to Write region while the West US 2 is taken offline (WUS 2 will be offline for 10 minutes — the experiment duration until it is completed, and Cosmos DB will do internal checks and sync’s before WUS 2 is brought online and made primary)

Now when the application is run, the data is now fetched from EUS 2 making cross DC calls (app is running from WUS 2).

Some of our observations and recommendations based on these experiments are (this architecture is close to customers production workload scenario),

  1. We observe that the customers have only one shard and mostly in one region. We also observe that they are in standard tier even for key applications or premium tier when they need active geo-replication.

To improve the resiliency of the applications and consumer experience, for applications that are key/important use premium over standard tier and enable multiple shards (say at least 2 or 3) and ensure your application switches to another Shard in case one shard is down / rebooted.

For applications that are mission critical, our recommendation is to move to Enterprise tier as it supports “Active” Geo-Replication and also plan multiple shards, so the data is evenly distributed and also there is a fallback option if one shard or region goes out. Service tiers and features of Azure Redis are listed in this site and will help you to choose the right one for your app — Refer these articles — 1, 2, 3

2. While Cosmos DB is a very fast and distributed NoSQL database giving sub-ms latency, still directly running on top of it when Redis cache is down will not only overwhelm the database but also increase the RU consumption, which will add significant cost and increases the latency. If the RUs are not sufficient then it may lead to 429s (throttling) since the RUs are allocated based on the need. The performance will take a further hit if all of these are happening in a cross DC.

Some of our recommendations on Cosmos DB includes, use the in-built cache in Cosmos that will not only help reduce RUs but also helps improve latency and reduce burden on the DB. — Azure Cosmos DB integrated cache | Microsoft Learn

Also, if your application demands more of availability over consistency then you can consider enabling multi-master. While multi-master is priced higher, still you enjoy the benefit of users writing to the nearest proximity database than moving all the writes to another region. This will help when there is a region outage, you can immediately switch to another region with very minimal data loss (as the other region is still keep on writing and not waiting for “to be promoted to a write region”)-How to configure multi-region writes in Azure Cosmos DB | Microsoft Learn

3. Finally, ensure that your application is setup in a fashion that there is DR region fully setup, which is ready to take over when primary fails. You can use active-active or active-passive region setup based on your app criticality and serve traffic from both the regions (having two regions not only improves traffic distribution but also help spin resources minimally in both regions instead of one region with more capacity). While Azure has paired regions e.g. WUS 2 and EUS 2 are paired by default, where depending on settings your data will be backup-ed in the paired region, still it advisable to go for a DR region.

With the new feature introductions like Azure Cosmos DB coming with differential RU allocations in different regions (for e.g. WUS 2 can have 600K RUs while EUS 2 can have 200K RUs for the same DB/Containers) — Per-region and per-partition autoscale (preview) — Azure Cosmos DB | Microsoft Learn, data lifecycle management (moving rarely used data to cold/archive tier), merge partitions — if the data is moved to a different tier / or purged, Cosmos DB instead of maintaining the same number partitions will merge and reduce the partitions. This assures more RUs per partitions giving better throughput. Distributing throughput across partitions helps a skewed / hot partition with more throughput. Finally, using the Burst Capacity in Cosmos DB wherever possible helps you allocate unused capacity to handle spikes. All of these means DR and better resiliency with less cost.

Same applies to any Azure services where you can improve the resiliency a lot by taking advantage of its new features there by improving customer experience with lower cost.

These are some of the observations and we gave recommendations, which significantly improved the resiliency of Cosmos DB and Redis operations. There is a lot more which we will document in the coming days based on the other set of experiments we conducted on how we improved the resiliency posture of some of the critical apps that we worked with customers.

We will request you to share your scenarios and how you improved resiliency in your apps in the comments section.

Happy to discuss on Azure Chaos Studio and various architectural patterns you have setup to improve resiliency.

— Pradip VS, Cloud Solution Architect — Microsoft

--

--

Pradip VS
Microsoft Azure

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.