Cache is the Root of All Evil

Avoiding Pitfalls of Common Caching Techniques

Vova Galchenko
Box Tech Blog
6 min readSep 4, 2020

--

Illustrated by Jeremy Nguyen /Art directed by Sarah Kislak

The practice of caching is about as effective at lowering latencies and load as it is at introducing nasty correctness problems. It is almost a law of nature that once you introduce a denormalization, it’s a matter of time before it diverges from the source of truth. The transient nature of caches makes problems very difficult to debug and clouds the matter in an extra layer of mystery. All this is to say that if you can live with the performance and load without caching, for the love of everything that’s good in the world, don’t add it. In some cases though, your clients can’t stomach the long latencies and neither can your system of record take the load, so you strike a deal with The Caching Devil (what’d you think that “d” in memcached stood for).

At Box we’ve had our share of run-ins with the beast and to tame it we’ve relied on many strategies well-known in the industry as well as some tricks we’re happy to contribute to the community’s tool belt. Since caching is most commonly used to optimize latency and load in read-heavy environments, for the purposes of this post, we’ll avoid the write-through cache variations and focus on caches that are populated upon read.

At a high level, reads look the value up in cache before reading it from the system of record, if necessary. The cache is populated on cache misses. Writes are responsible for invalidating stale cache values.

As the computer science adage aptly proclaims, cache invalidation is the hard part. Figuring out which cache keys are rendered stale by a given system of record mutation is often not trivial. Although this can be very tedious, it is, however, relatively easily reproducible and testable. On the other hand, concurrency-related cache consistency problems are a lot more subtle. Readers experienced with distributed systems will notice a couple of such problems that can occur in the caching system described above:

  • In case of high-volume read traffic, a write (and thus a cache value invalidation) can lead to a thundering herd of readers storming the system of record to reload the value into cache.
  • A concurrent read and write can cause a stale value to be stored in cache indefinitely. Consider the following sequence of operations, for example:
This serialization of steps yields a persistently stale value in cache: the reader writes a value it read before the write alters the system of record and invalidates affected cache values.

The canonical solution to both of the above concurrency issues was introduced by the famous 2013 Facebook paper entitled “Scaling Memcache at Facebook”. The concept of “leases” is introduced as sort of a per-cache-key lock preventing thundering herds and stale sets. It relies on two common cache system operations:

  • atomic_add(key, value): set the provided value for key if and only if the key has not already been set. Otherwise, the operation is failed. In Memcached this is implemented as add, while in Redis –SETNX.
  • atomic_check_and_set(key, expected_value, new_value): set the new_value for the provided key if and only if the key is currently associated with expected_value. In Memcached this is implemented as cas. Unfortunately (and surprisingly), Redis doesn’t have a command with such semantics, but this functionality gap can be closed trivially by a simple Lua script.

With these concepts in mind, our read operation implementation can be amended as follows:

Read implementation amended for thundering herd and stale set protection.

This approach allows your cache to effectively shield the system of record from thundering herds. In case of a cache miss only one lucky request will successfully be able to add the lease and interact with the source of truth, while others will be relegated to polling the lease until the lucky request populates it with the calculated value.

This mechanism also protects us from the race condition described above. Cache poisoning occurs when the system of record is mutated and cache is invalidated between the time when a reader fetches the data from the source of truth and the time when they put it in the cache. This model will prevent the reader from poisoning the cache because their atomic check-and-set will fail in case a writer changed the record underneath them.

Although for the time being he is puzzled, unfortunately, the cache devil does have more tricks up its sleeve. Consider a use-case where your data is consistently read frequently, but parts of it also undergo periodic bursts of frequent writes:

Reader 1 experiences a ridiculous amount of latency as it waits in vain for Reader 3 and Reader 2 to populate the cache key of interest.

A pathological condition may take place where during stretches of numerous writes, readers end up taking turns acquiring leases and querying the system of record only to have their lease cleared by a write. This in effect serializes reads from the source of truth but doesn’t deduplicate them, which ultimately causes very high read latencies and timeouts as readers wait to get their turn to fetch the value they need from the source of truth.

We faced this problem within our distributed relational data service at Box and thought of a couple of solutions related to this. The approach we ended up going with drew upon the insight that any read that’s waiting on a lease can safely use the value retrieved by the reader that holds the lease, even if the lease ultimately ends up getting cleared by a writer and the final atomic_check_and_set fails. Indeed, if a reader encountered another reader’s lease, the reader must have arrived before the writer cleared the cache value and thus before the write was acknowledged, so both readers can return the value retrieved by the lease holder without sacrificing read-after-write consistency. To take advantage of this insight, in addition to performing the atomic_check_and_set to attempt to store the value computed from the source of truth into the cache, the reader who acquired the lease will also stash away the value in a different location in the cache that can be discovered by readers waiting on the lease.

Using a flowchart to illustrate the algorithm becomes complex and hard to read, but below is a code snippet that does it. The snippet is written in super procedural Java aimed at clarity of high-level approach with no attention given to error handling, compile-time safety, maintainability, etc.

The devil has been stumped by this approach for a while now, as we’ve been using variants of this algorithm for millions of requests per second for the distributed relational data tier at Box. Being some of the devil’s most loyal customers, we hope this overview of our dealings with the beast helps in your struggle for performance and consistency.

If you’re interesting in joining us, check out our open opportunities.

--

--

Vova Galchenko
Box Tech Blog

I’m responsible for Core Application Infrastructure at Box