Migrating Cache Databases

Trishul Nagenalli
Abnormal Security Engineering Blog
9 min readJun 24, 2021

At Abnormal, we recently switched the storage layer for our caching infrastructure from Memcached to Redis. While migrating between the two, we learned a number of lessons we wanted to share.

Refresher — What is caching?

Caching is a technique used to improve the efficiency of a system. Sometimes, we have to repeatedly perform expensive operations, like querying a database or downloading a file. Without caching, we would have to query or download the data each we time needed it, even if the data was not changing. If we can be confident the data is not changing, and we need to access it a lot, we cache the data by saving it somewhere much cheaper to access. This pattern is used in several places across Abnormal’s threat detection system to optimize performance.

Caching at Abnormal

When we built our caching system, we stored data in an in-memory database called Memcached, but we have since adopted a second in-memory database technology called Redis for some use cases where Memcached doesn’t make sense (for example key-value lookups for feature hydration).

We soon realized we could unify our entire technology stack around Redis by migrating the cache infrastructure from Memcached to Redis too. At Abnormal, we are constantly trying to improve our systems rapidly to keep ahead of the attacker. We specifically value velocity in our pace of development, and maintaining two in-memory technologies slowed us down.

Our Cache Client

Let’s start by taking a look at our cache client code. We built a cache library to make caching convenient and consistent for developers across many teams. At the library’s core is the CacheClient — a thin wrapper around an underlying pymemcache client. Here is a stub distilling the important parts.

class CacheClient:
def __init__(self, cluster, env, namespace, **kwargs):
pass
def get(self, key):
pass
def set(self, key, value):
pass

A few notes on the constructor

  • cluster identifies which Memcached server to connect to.
  • env identifies the client’s environment(prod, dev, etc…)
  • namespace identifies a single logical cache on the server. A given namespace stores only one kind of data. Using namespaces allows us to host multiple distinct caches on the same redis cluster.

This client was written in the very early days of Abnormal. As the need arose, we wrote tooling on top of this wrapper.

  • When we needed to limit the number of connections we were making, we wrote a connection pooling client.
  • To make caching really easy we wrote an opinionated decorator that would cache the results of any python function it was applied to.

By exposing this bare CacheClient and additional tooling, we gave consumers considerable flexibility on how they would do the following:

  • Manage connections to the Memcached server
  • Construct the key for a specific cache value
  • Serialize the underlying cache value

The Migration Client

The flexibility of the CacheClient aided speed of development in the early days, but it also presented some challenges when moving to Redis. Our existing CacheClient allowed consumers to configure the underlying connection to Memcached. Our Redis implementation had to be a drop-in replacement for Memcached, meaning it had to handle the same parameters and configuration that were originally passed to an underlying Memcached specific client. Since the CacheClient is library code, it was in use by many teams all over the codebase. To rollout the Redis client, we had to be confident it was production ready on all namespaces.

To put aside any doubts about edge cases, we created a “migration client”. This migration client created a connection to a secondary cache in “dark mode” alongside the primary cache. This “dark” launch meant that we would read from and write to our secondary cache but ignore the results coming from it. Think of it as dress-rehearsing the secondary cache.

During the migration, a Memcached client served as the primary cache and a Redis client served as the secondary cache. We also added a feature flag to swap the two when we felt the Redis client was ready.

To ensure we did not encounter performance degradation by adding more work to every read, we offloaded data verification to an asynchronous worker thread. The migration client’s get code looked something like this.

def _verify_equality(key, primary_value, secondary_cache):
secondary_value = secondary_cache.get(key)
if primary_value == secondary_value:
record_comparison(match=true)
else:
record_comparison(match=false)
def get(key):
value = self.primary_cache.get(key)
# Run the verify_equality function later
self.async_queue.add(self._verify_equality, key, primary_value, self.secondary_cache)
return value

Trouble

After migrating all but one namespace to Redis using the migration client, we ran into trouble. Running the migration client on the last namespace inadvertently caused a connection leak to build up overnight. When we hit our connection limit, the cache refused new connections and took down several services that relied on its availability. When we diagnosed the problem, we rolled back the change that added the migration client to the last namespace. We did not, however, fully understand why this particular namespace had created a connection leak. Let’s dive into the investigation.

Connections

Our first step was to understand where the connections were coming from. Our migration client was built so that no additional connections were made to Memcached. Yet, the graph of connections looked something like this.

Figure 1: Connections to Memcached. Before the drop, clients ran the Migration client. After the drop, clients were running the Memcached Client.

We were leaking 12,000 connections an hour on the migration client (ouch)! Once we reverted the change (at the drop), we were getting by with just a few hundred.

This last namespace cached tokens we used to authenticate with third party services. Whenever we needed to fetch data from a third-party API, we would retrieve a cached authentication token from Memcached instead of re-authenticating. As we dove deeper, we realized that this service did not pool connections between requests for third-party data. Each time we needed to access a third party service, a new connection to Memcached was created and destroyed.

In a connection pool, connections to Memcached would be shared by a certain group of consumers. When a consumer needed to talk to Memcached, it would borrow a connection from the pool, do its work, and return the connection to the pool. The size of the pool, and therefore the number of concurrent live connections, should be about the number of threads that need to access the Memcached server in parallel. All the other namespaces we had migrated used connection pooling — but this one did not. When a consumer needed to talk to Memcached, it would create and destroy its own dedicated connection.

We next discovered that the code to close the connection was missing. The connections were probably closed by the Python garbage collector when the request was complete. This code was written near the founding of the company and had been in production for over 2 years, but this potential issue had never manifested itself as a problem — until this migration.

Connections and Concurrency

When we inserted our migration client, the lifecycle of the Memcached Client changed significantly for this namespace.

Figure 2: connection lifecycle with Memcached client

Originally, the connection only needed to be kept alive only for the time required to fetch the token from Memcached. Since this is an in-memory cache boasting sub-millisecond latency, this is not long at all. We must assume garbage collection (GC) closed the connection quickly after this small amount of work was done.

Figure 3: connection lifecycle with Migration client

Once we switched to the migration client, the Memcached connection was kept alive after the read and placed on a queue in order to conduct the asynchronous work. Most tasks were processed on the order of 10s of milliseconds. Requiring this extra work from the connection caused us to keep the connection alive for an order of magnitude longer. The SLA for these tasks is 10 seconds — so in the worst case, we could keep a connection alive for several orders of magnitude longer! Placing the connection on the queue raised the lower bound for the number of connections we needed in parallel substantially.

Our data suggested, however, that the above explanation was incomplete. The extra time spent in the worker queue did not seem to add enough delay to explain the extraordinary leak we had. To fully understand what happened, we would need to understand how GC closed our connections. Adding multi-threading to this environment likely made it more difficult for GC to figure out when the connection was no longer in use. Our data suggested that some connections were indeed closing but we didn’t know exactly when or how.

We did not spend much time investigating how GC closed connections however. Regardless of how it worked, our fix was to manage the connections ourselves, never relying on GC. We have finite resources to build and scale all aspects of our product. While looking into GC would be a very interesting investigation, we didn’t deem it a priority.

Lessons Learned

At Abnormal, we post-mortem every incident so we can constantly improve our own abilities and prevent similar mistakes. In our post-mortems, we try to go deeper than the surface level solution. Here, for example, a first-level solution is to close our connections. We must of course do that. We should also, however, ask if there is any deeper lesson to be learned. Was there something we could have done to avoid this problem altogether?

Always use connection pools. Libraries should be opinionated on shared resource management.

If our cache client library had pooled connections from the beginning, we wouldn’t have to worry too much about developers forgetting to close connections themselves later. The unopinionated flexibility we offered consumers of the cache library may have created more problems than benefits. It would have been better to write more strongly opinionated code around connections and offer escape hatches if consumers intentionally wanted to break from the default opinions of the library.

The redis-py library is written in this way. Each Redis client automatically creates a connection pool so consumers don’t have to worry about managing connections. The recommended usage is to instantiate a single redis client per connection url.

Thorough production readiness reviews are worth it.

It took a while to diagnose this issue because we had not set up an alarm for connection counts to Memcached. When using a new technology, it’s important to conduct a thorough production-readiness review. Readiness reviews uncover potential bottlenecks or points of failure in the underlying system that need to be monitored.

When we set up this Memcached instance, we were a very young company with light traffic. We were not close to any resource limit on this managed service. Moreover, this was cache infrastructure that we did not consider critical. We overlooked setting up an alarm back then. It would have been helpful if we had spent more time on the readiness review when we were young, or had later come back and conducted a more thorough examination of the cache system.

Don’t share live connections across threads — share information about the connection

Our migration client passed the connection object between threads via a queue. This split the responsibility for opening and closing a connection between different threads, adding unnecessary complexity. The connection was also kept alive while not in use. Our Memcached server could only support a finite number of connections. Keeping idle connections alive without reason contributed to this limit.

Instead of passing the connection object itself, we should have passed the information required to recreate the connection. The worker pool could then open and close a connection for only as long as it needed. This would also allow live production code to use a distinct connection pool from the asynchronous workers.

We’ve got lot’s of interesting challenges like this ahead of us. With a rapidly growing feature set and customer base, we are trying to scale in every direction and learn a lot along the way. If you’d like to join us on our journey, yes we are hiring!

--

--

Trishul Nagenalli
Abnormal Security Engineering Blog

Software Engineer @ Abnormal Security | Duke ’20 | Technology and history enthusiast.