Resiliency and Chaos Engineering — Part 2

5 min readMar 15, 2022

This part is in continuation to the part 1. Kindly go through the previous part, where I have outlined the agenda and briefly talked about failures, why it is unavoidable in the modern era of cloud computing / distributed systems and the way moving forward.

One key thing to note is, when we are dealing with smaller systems of up to few tens of instances, 100% operational excellence is often the normal state and failure is an exceptional condition. However, when dealing with large-scale systems, probabilities are such that 100% operational excellence is near impossible to achieve. Therefore, the normal state of operation is partial failure.

In this part I will talk about points 4 and 5.

4. How to embrace failure and improve it through resiliency?

5. How to achieve continuous resiliency? The architectural patterns to be followed in improving it.

Resiliency at the heart of every layer. Source: Link

In fact 5 is a big topic so I will break it down into 2 to 3 articles before going to the subsequent points in the agenda.

Okay, lets continue from where we left.

How do you trade the fear of failure for a growth opportunity and improvement?
The answer is to embrace failure and move towards continuous resilience.

Let us first look at the definition of Resiliency and talk about the patterns to improve it.

Resilience is the ability for a system to respond, absorb, adapt to, and eventually recover from unexpected conditions.

Continuous resilience is a philosophy, a mindset that embraces complexity, values continuous improvement, and understands that failures are inevitable. It is a way to anticipate failure, effectively monitor and respond to issues, and encourage learning.

Resilient systems embrace the idea that failures are typical, and that it’s completely OK to run applications (nonlife critical applications) in what we call partially failing mode.

Building resilient architecture isn’t all about software. It starts at the infrastructure layer, progresses to the network and data, influences application design and extends to people and culture.

Resilient architectural patterns are classified into four categories,

Infrastructure Resiliency
Avoiding the butterfly effect
Caching as a Resiliency Pattern than to deliver content faster
Continuous health check and monitor important metrics

Let us talk about Infrastructure Resiliency now.

As mentioned above, resiliency starts at the very infra layer before it is applied in the subsequent layers above. The following are some of the key patterns to be followed in Infrastructure Resiliency,

Redundancy — duplication of components of a system in order to increase the overall availability of that system. (E.g., Multiple AZ, data replication strategy etc.). Here the idea is to enable 3 replicas wherever possible, be it Availability Zones or data replicas. The very reason to have a standard of 3 redundant systems is based on this formula, where every redundant system addition increases the availability by 2 nines and thus decreases the downtime drastically.

The ecommerce giant has followed this principle be it their VM’s where they have a fault domain and an update domain or NoSQL Cosmos DB / RDBMS Azure SQL DB where they have enabled Replication and all these data or instances are housed in different zones/regions or rack, which doesn't share the same power systems / networking cables.

2. Infrastructure as Code (IaC) — improves repeatability, avoids manual errors, serves as version history & for knowledge sharing. IaC is also preferred when CI/CD, DevOps practices are adopted making the deployment easier and faster without any errors.

The ecommerce client does all their deployments using Terraform and hence all their environments have consistent version of infra deployed.

3. Stateless Application — Application(s) must treat all client requests independently of prior requests or sessions and should never store any information on local disks or memory. This is where Cache comes into play and it will be covered in the fourth section on this topic. The cache should be distributed so it can failover to secondary region if the primary fails (Redundancy)

The ecom giant uses MeghaCache (implementation of memcache) and in some cases Azure Redis Cache to ensure their applications are stateless and the caches are replicated across data centers in USA region (East, South Central, West etc.)

4. Auto Scaling — Automatically adjusting capacity to demand. This is a key pattern to optimize resources and cost. This is achievable only in Cloud and something not easy to achieve in an on premise world. Cloud is designed for this and the architects should use this feature wherever possible.

Azure Cosmos DB and Azure Databricks are the two key examples I can give where our customers leverage auto scale efficiently, which takes care of spikes and scales automatically, say during a sales event. During normal times, it runs only on the minimum scale there by saving huge costs to the customers which on premise or provisioned hardware cannot do.

5. Immutable Infrastructure — replace than update. There are scenarios where you can update a part of system or spin up a new system and deploy the solution. When the system is upgraded say with new OS but with old hardware stack, the application may not work as expected and may result in unnecessary triages. The ideal way is to spin up new VM’s for example with new OS and most compatible hardware and test if the app is performing as expected. Again the ideal way to test this deployment is to use Canary deployment technique.

Canary Deployment for Immutable Infrastructure.

For our customers, when faced with an issue where the Cassandra deployed on Azure VM gave high latencies. We found multiple root causes that it is running in gen 4 hardware and the CentOS is outdated / not supported any more. Instead of upgrading the OS, we are completely moving to a new hardware and current stable version of CentOS to make the Cassandra DB’s more performant. Here we are following the immutable infrastructure pattern with Canary Deployment technique to check how it performs before rolling out the change to larger pool of Cassandra systems.

Finally,

6. Automation — Wherever possible we are automating. This brings a lot of benefits like avoiding manual efforts, errors and saves time/cost etc. etc.

This concludes the first pattern in Resiliency architecture practices. By following these improves infra resiliency and availability of systems.

In the next part, let us focus on Avoiding the butterfly effect (or the cascading failures).

Part 3 URL is Here.

Thanks & Stay tuned —

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

Resiliency and Chaos Engineering — Part 2

Written by Pradip VS