Injecting Fault in Azure Cache for Redis using Azure Chaos Studio

Pradip VS
Microsoft Azure
Published in
8 min readApr 20, 2022

This blog will demonstrate how to inject faults in Azure cache for Redis using Azure Chaos Studio.

I have covered caching elaborately in the Resiliency and Chaos Engineering Series — Part 4 and how it serves resiliency more than content acceleration. I have covered Azure Chaos Studio and some demos in Part 7 of the same series.

This post is dedicated to show step by step on how to inject faults in Azure Cache for Redis using Azure Chaos Studio and how to verify if the fault is indeed injected. One of the things to note is that through Azure Chaos Studio we can inject fault at shard level and hence you can restart one shard at a time or multiple shards at the same time and with time delay feature you can inject fault in one shard and after a delay, you can inject fault in the same or another shard. Azure Chaos Studio gives the flexibility to inject faults in sequence or parallel and with or without delays.

Let us jump into the scenario and demo,

Assume this scenario — Applications use Caching and when the cache gets restarted suddenly, how the applications will behave? Will it reach out to the database directly causing more latencies or will the secondary cache come into effect (if so, how quickly) to avoid damages to the application & to the user experience. In order to test this & various other scenarios, Azure Chaos Studio provides an experiment where you restart a Redis cache at shard level.

  1. Create a Redis cache and decide the number of shards based on the application needs. The shard details can be found either from portal or through az redis command as mentioned below,

az redis show -n <<redis cache name>> -g <<resource group name>>

shard 0 and 1 are created for experiment purposes

2. Now build an experiment in Azure Chaos Studio, which restarts the above Redis cache instance on primary node or secondary node or all nodes. Before that you have to onboard the Azure Redis Cache as a target into Azure Chaos Studio. Azure Chaos Studio is supported in only a few regions (as it is in preview), so you have to onboard the target, which should be in a region that is supported (Check the region availability of Chaos Studio here — Azure Products by Region | Microsoft Azure). (Note: Azure Chaos Studio can inject fault into any resources in other regions and in other subscriptions, but the region has to be supported)

To onboard Azure Redis into Chaos Studio for fault injection, go to Azure Chaos Studio → Targets → Select Azure Redis that is created/present and choose Enable Targets → Service Direct. (Currently only Service Direct is supported for Redis and not Agent Based fault injection)

Onboarding a target into Azure Chaos Studio and enabling it with Service direct or Agent based

In this example, as a first step I have built a chaos experiment as,

Restart shard 0 only on primary node and not on secondary node.
and target → set the Redis cache that is created in the step 1

Restart primary node shard 0
Target is set to the resource that you have created

3. Now go to Azure Cache for Redis and grant the access to this chaos experiment with Redis Cache Contributor role in the Access Control (IAM)

Redis Cache Contributor for Chaos Experiment

4. Run the Chaos experiment and once it is successful check the metrics if the Redis cache is restarted.

Experiment ran successfully

We can use various metrics like operations per second per shard or server load and see if the restart indeed happened.

Next is check how your application behaves until the restarted instance comes back? And make a note of latencies, secondary take over and how quickly first came and synced up quickly and the latencies came back to the normal levels.

4. Now let me add a few more steps to the experiment. In addition to restarting shard 0, I’m adding two more steps.

a. Delay for 5 minutes (or wait for 5 minutes)
b. Restart shard 1
Now there will be three steps and the restarts are invoked only in the primary node.

If your redis cache has multiple shards it is good, else you can scale up and down as needed.

Scaling up the Redis cluster size and now there will be two shards 0 & 1
Once the scale up is over, it is reflected in the portal or redis show command

5. Editing the chaos experiment and adding two more steps as mentioned above,

adding delay for 5 minutes and restarting shard 1 after that

6. Run the experiment again and you can observe shard 0 and 1 gets restarted with a delay of 5 minutes. (Check the metrics)

Experiment ran successfully

One can find various metrics that shows the shards are restarted one after the other with a delay of 5 minutes.

There are many scenarios one can envision, what if the cache reboots and recovers as expected, and the application isn’t impacted by the reboot, then your system/architecture is already build resilient. But if the reboot caused temporary unavailability or latency or some unexpected behaviour, then this experiment has revealed a gap in the resilience of the application.

Now this will make the architects to think what is the right approach to resolve that gap? Maybe add shards, put the cache behind a load balancer, enable zone redundancy or geo replication if standard replication was in place and many more. The below URL and some of the URLs under the best practices gives how to effectively setup Redis Cache to make the applications as well as to handle disasters efficiently.

High availability for Azure Cache for Redis | Microsoft Docs

Best practices for connection resilience — Azure Cache for Redis | Microsoft Docs

This chaos experiment / fault injection can be achieved using portal, or Rest API or CLI.

Next, if you want to track the health of Redis, you can use redis cli.

If you have an ubuntu system, all you have to do is to install Redis CLI
https://redis.io/docs/getting-started/installation/install-redis-on-linux/#install-on-ubuntu

and if you are a hardcore Windows person like me, install WSL and install the Redis CLI in that

Install Redis on Windows | Redis and if needed install jq too using the command

sudo apt-get jq

If you invoke any commands in the CLI and that is failing as wrong username — password pair, then do the following

use the commands which az and which jq, which gives which CLI is being used.

which az and which jq commands

If it is using Windows CLI than Azure CLI, install Azure CLI — https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-linux?pivots=apt

Once done, you can reverify and see if it uses Azure CLI

az login

az account show

Az login will let you login with your id (if not done before) and account show will show, which subscription it is currently pointing to before you run the commands below.

you can either invoke the below command as is in the prompt or you can save it as shell file and invoke it.

log() {

echo “[$(date -Ins -u)] $1”

}

monitor_redis_nodes() {

cache_name=”$1"

cache_rg=”$2"

cache_json=”$(az redis show -n “$cache_name” -g “$cache_rg”)”

redis_host=”$(echo “$cache_json” | jq -r ‘.hostName’)”

instance_ports=”$(echo “$cache_json” | jq -r ‘.instances[].sslPort’)”

cache_pwd=”$(az redis list-keys -n “$cache_name” -g “$cache_rg” — query primaryKey -o tsv)”

while true; do

while read -r port; do

log “Pinging port $port…”

REDISCLI_AUTH=$cache_pwd timeout 3 redis-cli — tls -h “$redis_host” -p “$port” PING

if [[ $? == 124 ]]; then

log “Command timed out”

fi

done <<< “$instance_ports”

sleep 5

done

}

monitor_redis_nodes svdchaosredis chaosrg

Put the above in a shell script file and invoke it (optionally you can add set -x on top of the above script if you want to debug)

vi chaosredis.sh
chmod +x chaosredis.sh
./chaosredis.sh

If you are running from a PC, which needs to be whitelisted, then add the IP of it in your Firewall settings in the Azure Cache for Redis.

Run the command and observe the results

before Chaos experiment started all the pings had a response

After the fault is injected (shards restarted) the below are observed. Connection is timed out, refused and other response to the above pings.

Once the Redis is restarted and back to normal, we can see the pings are back to normal

This can be automated as mentioned in the Part 7 of the Resiliency and Chaos Engineering blog series.

This concludes the blog on how to inject fault in Redis Cache through Azure Portal and monitor using metrics in both portal as well as in redis-cli commands. In the next part, I will cover how the entire experiment can be fully built using REST API based approach.

I would like to sincerely thank my colleague Chris Rice for the guidance on Azure Cache for Redis and with the redis-cli scripts & setup, which were super useful in building this demo.

Thank you and let me know for any specific questions,

Pradip VS

Cloud Solution Architect — Microsoft

--

--

Pradip VS
Microsoft Azure

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.