Caching in on Failure

5 min readMar 29, 2019

What it means when your cache goes MIA

This is not my story, but it is one I’ve seen it several times now. It’s a story worth telling because it happens time and time again. It is an excellent example of deeper problems in software and systems engineering. Let’s follow a system’s growth over several years to explore the problem.

In the beginning, the system was built. Users were happy, on-call pagers were quiet, and for a time, things were good.

The birth of a service. Isn’t nature beautiful?

Word got around, and the system started to grow. Problems were solved, and users were happy. Slowly, people noticed problems as the system grew. Some requests were slow, the data store started to get expensive. On-call pagers started to get noisier. Eventually, an engineer realised that most of the problems can be solved by adding a cache.

A reasonable way to handle increased traffic

A cache is a quick solution to an age-old problem. Getting data from the data store is too slow. But fear not! The cache keeps the most relevant data near-by, in memory, with the speed of your favourite hash table. The engineer adds a contingency: ‘IF the data is not available in the cache THEN check the data store.’ They know that the cache might not have the data needed, but they also know that the cache is a good first guess. A particularly diligent engineer might even set a timeout on the call to the cache!

The cache works well, and the system begins to scale once more. Users stop complaining and pagers go quiet. Management is happy because they can cut costs on that expensive data store. Profits begin to rise.

One day, the cache fails. The exact reason doesn’t matter, but what does matter is that the cache is either not responding or had no relevant data. The system must now send all requests directly from the data store — a data store that has not been scaled with growth.

The data store, flooded with unexpected traffic, goes down. The system frantically retries the failed requests, further increasing load on the data store. Sometimes the system’s users desperately retry manually, trying to get the system to work done under the spectre of a looming deadline.

Even if the root cause is fixed, the cache can’t warm up because the data store is too busy failing under load. The system has entered a fabled ‘meta-stable failure mode’. Recovery is costly, but the steps are well known:

Turn away incoming traffic. In many businesses this means losing customers.
Fix or work-around the problem. This could be hours or days. Sometimes the only answer is to scale up the data store at great cost.
Warm up the cache. There are two main strategies here, either use synthetic traffic or gradually ramp up incoming traffic.
Write the postmortem.

The point of this post is not “caches are bad.” Far from it, caches are an essential tool in our arsenal. The obvious lessons are:

Caches can help a system scale, but they should be treated with care.
Meta-stable failure modes exist.

However, there is a deeper lesson about systems and software engineering here.

Systems send out two signals quite clearly — their operational expense, and their ease of operation. Managers complain about the first, and engineers complain about the second. They become more acute as the system grows. However, there is a third factor which is absent: the system’s safety. There is no clear signal that the system is unsafe.

The system edges closer to the boundary every day. People don’t really think about it because it’s an abstraction; a theoretical risk. Further, the boundary is fuzzy, more like a probability map. One day you do something and everything is fine, but the day after someone else does the same thing and the whole system plunges off a cliff. Everyone looks around and asks “how did we get here?”

It’s like walking up a stair-case in your home that has a rotten support structure. Each step looks fine, then one day your foot goes through the 4th step and you break your ankle.

This is the Drift into Danger model. The system is gradually made unsafe under economic and operational pressures. Safety goes unacknowledged until there is some incident. Sometimes, there are smaller incidents that are not fully investigated and not properly understood. The problem was minor, so the system is safe. These incidents, if they exist, are a canary in the mine shaft. They should be treated as valuable feedback on the system’s safety.

There are few ways to counter these issues:

Proactively analyse the system to find problems. For example, building a table of dependencies (internal and external) and failure modes (fail-STOP, fail-SILENT, fail-SLOW, etc.) and analysing the likely outcomes. Periodically re-check these to see if you’ve sprouted any new dependencies. Fix the problems you find.
Push the system over the cliff with a ‘game day’ or continuously with Chaos Engineering. Fix the problems you find.
Write postmortems for any alarm, even false alarms. Fix the problems you find.
Read other teams’ or companies’ postmortems to find similarities. Fix the problems you find.

Notice the pattern? No matter what you do, you must fix the problems.

This takes time and effort from customer-facing (revenue-generating) features, so there is a natural resistance to them. Further, people get used to high levels of risk. A lucky success today is as a guarantee tomorrow. Deviance is normalized. Running at or close to the edge lets you out-perform competitors, so there’s a strong signal that ‘everything is fine’ when things are actually rotten underneath.

Every business faces these cost and difficulty pressures. The only way to avoid them is to make reliability a top-line priority and give people the time to invest in the their systems’ reliability and fix the problems they find.

Caching in on Failure

Written by Dan Turner