LazyLRU: Laughing All the Way to Production

Mike Hurwitz
Bluecore Engineering
7 min readJun 2, 2021

Understanding our access patterns led Bluecore to consider a more efficient caching strategy. The whole thing started as a fun little joke, but ended up being enough of an improvement that we ended up using it.

Untitled by Cris Saur

We’ve written previously about our recommendation-serving infrastructure. Near the bottom of that article, Alexa slipped in a line about a least-recently used (LRU) cache sitting between our central product cache and the rest of our application. It is a small note in the overall architecture, but the impact was huge.

Who among us hasn’t ever started calling something a silly name to make your friends laugh, but then found themselves explaining that thing to completely unrelated people years later because that’s just something you say now? I had a friend in high school we called Mad Dog. I didn’t even know his real name for years. Well, this is a story like that.

Access Patterns Are Everything

Bluecore’s flagship product sends billions of personalized email messages with ecommerce products in them. Most of the time, the set of distinct ecommerce products referenced in a campaign of emails is orders of magnitude smaller than the number of emails being sent. That means that a fetched product is likely to be needed many times in the next few minutes, but may not be seen again for hours or days.

Products are constantly changing: prices, merchandizing categorization and most importantly, stock status, so holding them in memory for more than a short time would mean either risking stale data or forcing expiration of cached items. Even at one minute expiration, with a 30kHz request rate, we would save 1.8 million fetches from the off-box cache. Memorystore is fast, but it’s no match for local memory. With our previously stated 2KB average product, compressed 5:1, we are still saving nearly 12MB/s of transfer and decompression. More importantly, 62MB/s of deserialization, allocation, garbage collection, etc.

Because products are so large, the number we can keep in memory at one time is limited. Using our 2KB estimate, we’d be able to hold ~500K products/GB. That’s not enough space to hold every product for every customer, but more than sufficient storage for all concurrent campaigns.

We have learned enough to get a basic set of requirements:

  • Concurrent access because this is going to be running in a multi-threaded server
  • Old items should not be returned, preferably evicted from memory because we don’t have a way to keep them fresh forever
  • The cache should not exceed a predefined size to ensure we don’t run out of memory

Two of our three requirements focus on how we remove data from the cache, so let’s start by focusing on that a bit.

Cache Replacement Policies

Untitled by Sigmund

The broad term for deciding which elements to remove is cache replacement. The first problem is that imperfect knowledge about what is going to be used guarantees that any algorithm must fall short of ideal. Secondly, the bookkeeping necessary for many algorithms makes them impractical.

For us, LRU replacement, when combined with the relatively short expiration of our elements, feels like a pretty good guess for a good policy. Perhaps as importantly, it’s easy to reason about. If we’re wrong, we’ve got a chance to understand why.

The implementation of a “true” LRU involves carrying data in some kind of map (tree, hash table, etc.) as well as a queue of things to be deleted. When an item is accessed (written or read), it is moved to the tail of the queue; when it is time to delete items, they are selected from the head of the queue. But that comes with a cost: every read modifies that queue. If you are using a linked list, pointers will need to be rewritten. If you are using a heap, sift operations will be necessary to maintain the heap invariant. If you actually change anything, these two structures (map and queue) will need to be accessed in a thread-safe manner, which usually means locking.

Don’t Roll Your Own, Unless You’re Joking

When we decided that an in-memory cache was needed, the first thought was to just take something off the shelf. However, it was also a Friday night and I thought it would be fun to try to implement this myself. Somewhere between a joke and a toy for the weekend, the LazyLRU was born. These are the components of the LazyLRU:

  • sync/RWMutex to control access
  • map to hold the data and support reads
  • container/heap to hold the queue that tracks LRU
  • A message envelope with an expiration holding the underlying data
  • A “reaper” thread to find and remove expired items regularly

The “lazy” part comes from how it handles the queue. Since we can guess that items near the back of the queue are not at any real risk, we can just leave them where we found them and not worry about the minor misordering in that part of the queue. Also, since there isn’t a write to the queue, we can have a shared lock for those reads, only taking out the exclusive lock when we need to pull an item out of line. We guessed that allowing for 100,000 objects in the cache would be more than enough space, but not so much that we would run out of memory. Under normal circumstances, we’d have room to spare. The cache is instrumented so that we could tell if we guessed wrong and adjust later, though we never had to.

The Reaper

To keep that room free means that we need to remove expired items from the cache. It is a fairly common practice to let old items expire by reaching the head of the queue, or by checking an expiration on the item when it is read. In my case, that would mean using more memory than we need and creating unnecessarily long sift operations on the heap. When the LazyLRU is created, a background “reaper” routine runs regularly. It picks a spot randomly in the queue and goes looking for expired items. Remember: due to reordering on read, expired items won’t all concentrate at the head of the queue. To keep the reaper cycles as short as possible, it reads 100 items with a (non-exclusive) read lock, only taking a write lock when it has at least one item to kill at the end of its cycle. Why 100? No good reason. Remember, this whole thing was supposed to be a toy.

Not Quite the Funniest Thing Ever

My weekend foray into caching turned out better than expected. In fact, I kind of liked it. None of the solutions I found online met all the requirements quite so well:

  • hashicorp/golang-lru was the most mature and has multiple implementations in it, but it doesn’t support expiration
  • golang/groupcache/lru is a nice, clean implementation, but both expiration and locking would need to be managed externally
  • hnlq715/golang-lru looks very promising, but it didn’t exist at the time of our story (March, 2020). As it is based on the Hashicorp implementation, I assume it will have similar performance characteristics

So despite my better judgement, we went forward with integrating LazyLRU. I was able to prove that it was good enough, but not necessarily that it was good.

Proving It

Access patterns are everything, so what access patterns best fit this model? This cache was designed to operate a high concurrency, but low contention. It should be reasonably efficient when full, but runs best when the live data does not exceed 25% of its capacity — that is the point where the cache starts to track the LRU and we start paying for the heap. Groups of keys should be hot for a period, then go cold and expire from the cache.

Benchmarking is complicated and a whole other post could be written about getting a reasonable benchmark for this code. Instead, I’ll just give a few data points in summary. Note that hashicorp/golang-lru does not support expiration natively. I wrote a thin wrapper that caches value-expiration structs and checks the freshness when reading to give it expiration powers.

Test “A” is more representative of our real circumstances. For Bluecore, LazyLRU comes out on top when the access pattern is closest to our real-world use case. Test results from many other configurations can be found here.

I do want to call out that all the implementations are delivering hundreds of thousands of requests per second under all circumstances, which is more than good enough! I’ve read the hashicorp code and it is very well done — that is part of why I am using them as the standard in my benchmark. The purpose of this experiment was to see if a custom-tailored strategy could do even better.

Want to try it out? Get it on github.

Interested in taking on these growth challenges? We are hiring several roles in R&D!

--

--