Invalidating Caches and Reacting to updates using Streams

Most of us rely on (mostly in-memory) cache hierarchies for performance, reliability and sometimes because doing away with them means our application or service just doesn’t work at all.

Most of our applications use memcached and other home-grown caching systems, and we necessarily need to keep the state on those systems in sync with the state of the truth/origin, which almost always is the latest ‘value’ stored in a persistent store.

The straight-forward way to do this is to identify all instances where your code updates the persistent stores, and then purge data from the caching layers, and sometimes do something else — like updating another persistent store, but doing that would require purging another value from some cache, and so on, so forth.

Indeed, that’s how we traditionally approached the problem. We ‘d try to remember to always do whatever’s needed when a value would change in our persistent stores. It works, except when it doesn’t.

Reasons for this strategy to fail include:

  • Easy to forget to match code blocks that update the persistent store with whatever is appropriate — expiring cache values, etc. Even missing this once can cause great problems, and good luck chasing code paths if you want to figure out the cause of the problem later.
  • If an application crashes after the persistent store is updated, and before the expiration logic gets to run to completion, you really won’t know about that. Maybe the caches were synchronised, maybe not — maybe in part. This is rare, but it can happen, and when it does, you are in for a lot of hair pulling.
  • Even if you correctly identify all code paths, and your application never crashes, you ‘d still likely have to remember to update the expiration logic in all kinds of places in your code, to match changes you made to your application and how the data you persisted relate to other data and operations. This is just not particularly elegant, all things considered.

A few days ago, we open sourced Tank, our high-performance distributed log system, very similar to Kafka and other such brokers. We chose to build our own for all kinds of valid reasons(we think), although Kafka would definitely have worked for us if we have chosen to use it instead. 
We built it first and foremost so that we could implement an alternative expiration scheme, because we had enough with caching issues, although we are now using Tank for a lot of different problems(when you gain access to a new technology or infrastructure service, you think of solutions and new ideas that just never occurred to you in the past. This has always been the case for us, with every new ‘toy’ we built, or new OSS system we incorporate in our infrastructure).

This is how it works now.
Our main persistent store is CloudDS, a very high performance distributed storage system, similar to Cassandra. We also use mySQL to a lesser extent for when we need the ACID guarantees of an RDBMS, and CloudDS/S3 (an S3 semantics and protocol compliant distributed object store — literally, an implementation of AWS S3) for storing objects, including files (many millions of them currently, each file can span TBs in size).

They can now be configured to emit an event(in practice, log to Tank) whenever either a row for a monitored ColumnFamily is updated (we log (keyspace, column family, primary key) to Tank) for CloudDS, whenever a monitored bucket is updated (we log (object/file path, uploaded file size) to Tank) and soon we’ll do the same for mySQL by monitoring its binary log.
UPDATE: we now log the individual column names updated as well for CloudDS rows updates, because that allows for fine grained expirations.

So whenever anything, any application, service, utility, updates CloudDS, whenever anything uploads or deletes files in buckets we monitor on CloudDS/S3, we log that event to Tank.

An external application consumes updates events from Tank continuously and reacts to those events. It purges values from caches, updates other data on persistent stores, executes book-keeping operations, whatever’s needed. For some files uploaded on CloudDS/S3 we monitor, we expire them from our CDN/edge nodes(though in practice the URLs encode a modification timestamp so we only need to do this for a few files/images).

There is just one place where we need to invalidate caches. All invalidation/update code has been stripped from our applications and it’s always fun deleting code.

The invalidation logic always executes, even if the application crashes right after it has updates a persistent store, because the update event is emitted by the persistent stores themselves and persist on Tank, and the invalidation application which consumes from the topic/partition that holds those events will eventually execute the invalidation logic one way or another.

It’s particularly elegant, very simple, and very, very powerful. This is not a novel idea of course — others have been doing it for a long while now. Facebook (see their TAO service), Netflix, Google, and others, albeit everyone seems to be doing it somewhat differently. The brilliant folks at Confluent have also extensively talked about it and built systems that facilitate this sort of design.

All told, we are very pleased with how this turned out. If you are dealing with elaborate cache hierarchies, and even if you don’t, consider this alternative (use of streams and reactive programming ) to whatever you are doing now. It will be worth it, and there it doesn’t really require much effort anyway.

A contrived example based on our implementation of the application that consumes from Tank and reacts to those updates is available on Github.

Like what you read? Give Mark Papadakis a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.