Resiliency and Chaos Engineering — Part 4

7 min readMar 18, 2022

This part covers Caching as a pattern for resiliency than to just accelerate the content. I will cover the techniques and considerations one has to follow while using caching in an architecture.

This part is in continuation to part 1, 2 and 3. Kindly go through them to get a broader context.

Caching is often used for accelerating content, but it serves the resiliency pillar very well. Source

“Research shows that 250ms will provide a competitive advantage to the fastest of two competing solutions.”
— Source: impatient-web-users-flee-slow-loading-sites
“Another analysis shows that with every 100ms increase in load time, sales dropped by 1%. A clear issue for ecommerce giants!”
— Source: Greg Linden, Data Scientist, Microsoft — A/B testing

Caching plays an important role in accelerating the content as well as a key component of an architecture.

While caching is often associated with accelerating content delivery, it is also important from a resiliency standpoint.

There are many caching techniques — Static, Dynamic, Page, In-memory, private, shared caching etc. The recommendation is to use all of them wherever appropriate (but be cautious while using it and potential impacts if caching is not done correct are covered below).

The following are the benefits of Caching from a resiliency standpoint:

DDos attacks — serving from edge & accepts only well-formed connections avoids SYN floods & UDP reflection attacks.

DDoS attacks are mostly geographically isolated (close to the source), using a CDN can greatly improve the ability to continue serving traffic to end users during larger DDoS attacks.

2.Dynamic Caching Content — helps minimize thousands / millions of hits to database, preventing devastating effects on DB’s.

3.Improved application scalability — moving some responsibilities to cache helps apps take more requests and scale easily.

Points 2 & 3 are key purposes of a Cache where if the hits on DB’s are reduced, databases can serve important requests and helps scale the application.

4.Reduce load on downstream services — saving database from same and same requests.

5.Improved degradation — Cached data can be used to populate applications’ user interfaces even if dependency is temporarily unavailable. (E.g., web streaming sites like Netflix / Amazon Prime)

This is one of important points covered in cascading failures (Service degradation and fallbacks). If the Recommendation Movies Microservices is unavailable, it is better to populate some content from cache instead of showing a 404 error or pulling down the whole site down (cascading failures) as mentioned in the below example.

Source: Amazon Prime — My Personal Subscription. Recommendation Movies

It is better to show something like this from cache if Recommendation Movies microservice is not available. Substituting with “Amazon Originals”

Show with Amazon Original series if Recommendation Series is not available. Source: Amazon Prime — My Personal Subscription.

Better than showing a 404 as in below

Showing 404 or this error pulling down the whole amazonprime.com will put the site in bad light and impacts user experience. Source: Amazon Prime — My Personal Subscription.

So, use Cache even the data is eventual / stale, and it helps in graceful degradation.

6.Lower overall solution cost — Using cached data can help reduce overall solution costs, especially for pay-per-request type of services.

All the benefits listed above not only contribute to improved performance, but also improves the resiliency and availability of applications.

A key aspect to call out is, Many of the Microsoft’s products and services are built with Cache. Azure Cosmos DB is now coming up with integrated cache, which some of the customers have already leveraged and benefitted. Similarly Azure Synapse’s adaptive caching based on NVMe SSD hardware provides high performance improvements, Redis cache can be leveraged in many of the solutions, also enabling cache in gateway, edge not only improves performance but also resiliency.

The de facto advice is to leverage cache wherever possible. However, caching has its own of challenges and if not applied properly it may impact Resiliency & Availability.

Data staleness & inconsistency — data in primary storage is not reflected in cache immediately and leads to eventual consistency. Understand how tolerant the application towards stale data before implementing caching.
Cache Expiration Policy — based on the application sensitivity towards stale data set the TTL for the objects. (Soft-TTL and Hard-TTL)

One point to observe is that the stale content needs to eventually expire from a resilience point of view, it should also be served when the origin server is unavailable, even if the TTL is expired, providing resiliency, high availability and graceful degradation at times of peak load or failure of the origin server. To cope with such requirement, some caching frameworks support a soft-TTL and a hard-TTL; where cached content is refreshed after an expired soft-TTL. If refreshing fails, the caching service will continue serving the stale content until the hard-TTL expires. Source: Adrian Hornsby, Principal Technologist, Amazon

3. Cache Eviction — A cache should be carefully chosen based on the application and request pattern. If the size is not enough, then the cache runs out of space and causes more database hits, which results in overload or affecting the scalability of application (as database will serve the request which a cache is supposed to perform).

4. Request Coalescing — Simultaneous cache misses collapse into a single request to the downstream storage (aka waiting rooms) — during startup or restart or fast scaling up times.

A note on Request coalescing - When thousands of clients request the same piece of data at the same time, but that request is a cache miss then this can cause large, simultaneous numbers of requests to hit the downstream storage, which can lead to exhaustion. These situations often occur during startup time, at restart, or fast scaling up times. To avoid this, it is better to collapse similar request into one and issue it to database than letting all requests to access DB simultaneously.

Caching patterns and best practices:

Cache Aside vs Inline Cache Pattern. Source

1.Cache Aside — treating cache as a different component to the SoR — Redis, Memcached

2.Inline cache — uses cache as primary component (uses read-and-write-through pattern)

Inline caches are easy to maintain but if it fails it is hard to compensate.

If any app wants to bypass cache and needs to write into DB directly (write around strategy) ensure the cache is populated with most frequently read data and NOT most frequently written data.

There are two other types of Caching Strategy, which Microsoft’s recommends in the distributed systems. It is very commonly used by many distributed applications

Private Cache — The most basic type of cache is an in-memory store. It’s held in the address space of a single process and accessed directly by the code that runs in that process. This is quick to access, effective for static data but each instance stores a copy of data. If the instance fails, its associated cache also fails. This cache is faster than shared cache.

A key aspect of this cache is that different application instances hold different versions of the data in their caches. Therefore, the same query performed by these instances can return different results

2. Shared Caching — different application instances see the same view of cached data. Benefit of the shared caching approach is the scalability it provides. However, it is slower than private caching.

When to use Shared, Private or a combination of both depends on the application and one has to weigh the advantages vs disadvantages before choosing the caching strategy.

Azure Cache for Redis is a popular cache that provides a caching service that can be accessed from any Azure application, whether the application is implemented as a cloud service, a website, or inside an Azure virtual machine.

Azure Cache for Redis is a high-performance caching solution that provides availability, scalability and security.

Microsoft’s best practice & guidelines for caching is covered in detail here

To summarize, caching provides resiliency and it is advisable to use caching in your application wherever possible (edge, gateway, NoSQL store or DWH etc.) and it can be any type of caching viz. private or shared or in line or aside etc. after carefully evaluating the application sensitivity to stale data, cache sizing for the given application, cache expiration and eviction policies et al.

In the next part, we will touch upon Observability and how health metrics & systems plays a key role in resiliency.

Thanks & Stay tuned….

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

Resiliency and Chaos Engineering — Part 4

Written by Pradip VS