Taming ElastiCache with Auto-discovery at Scale

JunYoung Kwak
Tinder Tech Blog
Published in
10 min readJun 21, 2019

Authors: Will Youngs|Software Engineer — Identity, Andy Phan|Software Engineer — Cloud Infrastructure, Jun-young Kwak|Senior Engineering Manager — Identity

Data Caching at Tinder

Our backend infrastructure at Tinder relies on Redis-based caching to fulfill the requests generated by more than 2 billion uses of the Swipe® feature per day and hosts more than 30 billion matches to 190 countries globally. Most of our data operations are reads, which motivates the general data flow architecture of our backend microservices (see Figure 1).

Figure 1 Cache-aside Approach: A backend service checks for requested data in a Redis cache before falling back to a source-of-truth persistent database store (Dynamo, Postgres, Mongo, etc.) and then upserts the value to Redis on a cache miss.

In the early days, Tinder ran Redis servers on Amazon Web Service’s (AWS) Elastic Compute Cloud (EC2) instances. Our applications then connected to these nodes via clients which implemented sharding by locally hashing keys to Redis instances (see Figure 2). This solution worked initially, but as Tinder’s popularity and request traffic grew, so did its number of EC2-hosted Redis instances, as well as the administrative overhead of maintaining them. The burden of 1) changing Redis cluster topology (e.g. scaling our clusters), and 2) orchestrating failovers sapped the time and resources of our engineers and led us to look for alternatives.

Figure 2 Sharded Redis Configuration on EC2: We maintained a fixed configuration of Redis topology (including # of shards, # of slaves, instance size), and services accessed the cache data through the sharded Redis client that implemented this configuration.

After exploring options for more scalable data caching, we settled on AWS’s ElastiCache (https://aws.amazon.com/elasticache/). Cluster-enabled Redis (Redis engine version 3.2.10) on ElastiCache allows us to horizontally scale with great ease. To scale up a cluster, all we have to do is to initiate a scaling event from the AWS developer console, and AWS takes care of data replication across any additional nodes as well as shard rebalancing for us.

Finally, AWS handles node maintenance during planned maintenance events with limited downtime and no intervention on our end. Excited to offload the infrastructural burden of Redis maintenance, we began migrating our backend services from EC2-hosted Redis instances to ElastiCache, and all seemed to be well until we noticed issues specific to some of our core Java applications directly downstream of ElastiCache.

Initial Java ElastiCache Client Implementation — Issues and Limitations

We researched ElastiCache extensively during the development of our original Java-based Redis client, however we did not find an AWS recommended Java library for Redis, or a widely adopted online convention for managing connections to cluster-enabled ElastiCache¹. Consequently, our first attempt was somewhat naive and did not take full advantage of the benefits ElastiCache offers.

In its first iteration, our Java-based Redis client used jedis (https://github.com/xetorthio/jedis), the most popular Java Redis library, to connect to our clusters via their primary endpoints. This simple solution was functional and allowed us to replace our legacy EC2 hosted Redis nodes, but we noticed connectivity issues during our clusters’ first AWS maintenance events.

A cluster maintenance event entails the patching of an underlying node, rendering it unusable for a short period of time. If the master node is affected, the maintenance event encompasses failing over a slave to become the new master (see Figure 3).

Figure 3 Description of Redis cluster failover of a primary node: In the event of a maintenance event on an ElastiCache cluster’s master node, one of the master’s slave nodes gets promoted to become the new master and the old master gets demoted to become a new slave before getting patched or replaced by a new replica. (Source: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html).

Our jedis-based client seemed unable to appropriately respond to the topological swap of a master and slave node. We suspected that after a failover, our clients were still sending read and write requests to the old master node, even though it was inaccessible due to maintenance; if this was true, this misconfiguration following a failover could be causing the timeouts and exceptions that were breaking our backend services.

In addition to failover intolerance, our first Java Redis client generated inefficient CPU usage patterns across the nodes of our ElastiCache clusters, which was concerning. During maintenance events, the CPU utilization on our cluster’s slave nodes spiked, but at all other times it was at almost zero.

We reasoned that maintenance events would increase the CPU utilization across our cluster’s slave nodes as they resynced data from their masters. However, the lack of activity at all other times suggested that we were maintaining our clusters’ slave nodes for essentially no other use besides failovers.

Though our initial Java Redis client provided the core CRUD functionality necessary to replace our legacy EC2 hosted Redis nodes, it failed to take full advantage of all that ElastiCache offered us from the perspectives of failover tolerance and node utilization.

ElastiCache should have allowed us to add additional shards and replicas to our cluster without app downtime. Instead, our client seemed to form static connections to the cluster at application startup and it was unable to dynamically handle changes to the topology of the underlying cluster. Any previously connected backend services would lose their ability to communicate with the connected ElastiCache cluster when it underwent a maintenance event. Our developers would then have to manually intervene, and redeploy the affected services to restore functionality.

Figure 4 Impact of Cluster maintenance event on production cache read traffic: Impact of an ElastiCache maintenance event causing an app outage. Fortunately, quick developer intervention & redeployment of affected services kept the duration of production outage + service degradation relatively short.

In the worst case, unplanned maintenance events (due to hardware malfunction etc.) could break connected services in the middle of the night and full recovery could take hours (see Figure 4). Our backend infrastructure critically relies on Redis-based data caching — so much so, in fact, that our failure to dynamically process the cluster failover of AWS maintenance events was at one point the leading cause of app-wide outages at Tinder.

Testing Infrastructure Setup

The first step we took to address the failover intolerance of our backend services was to replicate a production failover as closely as possible in a staging test environment. We wanted to be confident any fixes we tested would behave similarly in production, so we set up an ElastiCache cluster that had the same parameter settings, cluster size, shards and replicas as our production cluster. Next, we set up a mock service that was configured similarly to our production services and simply wrote and read from ElastiCache. Finally, we set up Prometheus (https://prometheus.io/) metrics, and logging for our mock service to help expose connection-based issues and error rates.

Analyzing the Root Problem

Before working on any fixes, we wanted to first reproduce our production issues in our staging environment. We load tested our old client against the approximate scale of peak production traffic using jmeter (https://jmeter.apache.org/) and initiated failovers via the AWS console. Despite our best effort to replicate production connectivity issues using our original Redis client (with comparable traffic, amount of data, and cluster size), we could only duplicate some of the connection-based exceptions we saw during production failovers.

Again, our Redis client’s core issue with respect to its failover intolerance seemed to be an inability to adjust its cluster view to reflect the node IP adjustment that occurs when a master and slave node swap. Essentially, we believed that our services were still trying to communicate with the old master node after it had been demoted.

To investigate, we analyzed the destination IP address of our mock service’s outgoing traffic to ElastiCache via tcpdump within one of the service’s Kubernetes pods.

Our service was able to establish a connection with ElastiCache following the failover, but tcpdump revealed that it was indeed still sending write requests to the old master’s IP address. This failure to update the client’s view of the cluster indicated that our service was treating the new slave node as if it was still the master (sending it write traffic even though slave nodes cannot accept it). This confirmed our suspicions regarding our initial client’s underlying problem and we set out to fix it.

A New Client

To address the issues we faced in our Java Redis client’s initial implementation, we set out to create a new one. This new client needed to maintain a local and dynamic view of the cluster that would refresh in response to cluster topology changes. Further, we wanted the new client to more fully utilize our clusters’ resources by performing read operations on cluster slave nodes. Finally, we set out to create a modular, robust, and extensible client that could be easily extended for use across all of Tinder’s Java-based backend services.

After some initial investigation, we found that jedis has no internal implementation to enforce reads from replica nodes, and so, to enforce slave-reads from our clusters we would have to explicitly maintain and dynamically manage a local topological view ourselves. Such a task would take significant development time, but could result in a performant client capable of performing reads from slave nodes with a self-contained, locally explicit view of the cluster.

After poring over the logs of several production maintenance events, we developed concerns regarding jedis’ ability to connect to an ElastiCache cluster directly following a failover (using jedis version 2.9.0). Application pods created shortly after a failover would attempt to establish a connection to our clusters via the instantiation of a jedis connection pool, but would fail and return with an exception: LOADING Redis is loading the dataset in memory. Jedis was unable to establish a connection to the cluster as the old master (new slave) synced with the new master, and loaded its data into memory.

Frankly, after this investigation, we were not entirely sure how difficult it would be, or how long it would take to create a failover-safe client in jedis. In light of this uncertainty, as well as a large and imminent batch of AWS maintenance events, we shifted our focus to a higher level Redis library with greater computational requirements, but a self-contained topological view of the cluster, Lettuce (https://github.com/lettuce-io/lettuce-core).

We wrote a new Java-based clustered Redis client to handle the business logic of orchestrating Lettuce’s exposed functionality for the purpose of failover resilience. The Lettuce Java library exports a function to refresh its internal topological view, so we listened for general connection-related exceptions from our ElastiCache cluster and invoked this function in a rate-limited fashion.

It was our hope that this client could adjust and reconnect to ElastiCache in case a cluster configuration suddenly changed due to a maintenance event or the addition of a new replica or shard. Conveniently, the Lettuce Java library also offers a set of intuitive and configurable slave read behaviours which can be simply set as an option during initialization. We hoped we could use this configurable setting to make more efficient use of our cluster’s resources.

Results

We tested our new Lettuce-based client in the staging environment the same way we analyzed the problems with our original client — we replicated production traffic levels with jmeter and triggered failovers via the AWS console. Exceptions began to flood the logs of our staging environment as the failover commenced, but unlike before these exceptions would dissipate within two or three minutes. Further, tcpdump from within our service’s pods revealed that the destination IP of write requests to our ElastiCache cluster actually changed to reflect the new state of the cluster post-failover. Encouraged by what we observed in our staging environment, we took our solution to production.

Our new Lettuce-based client was deployed with our production service right before a scheduled maintenance event. When the event happened, we momentarily saw some familiar connection based exceptions and feared the worst, but they soon abated. Prometheus metrics revealed that our rate-limited topological refresh of the Lettuce library had been triggered, just as we had hoped. The new client seemed to have fixed our connection-based issues, but we noticed our service’s threadpool was still exhausted after upstream requests overwhelmed our service’s pods.

We let this situation ride until it became apparent that these exceptions were not resolving quickly, and then increased the number of Kubernetes pods in our services deployment, which swiftly resolved the issue. We adjusted our horizontal pod autoscaler settings to increase the number of pods we had — and thereby, increase the number of threads we had available.

Consequently, during our next maintenance event on our production ElastiCache Cluster, we achieved our first true auto-failover with zero developer intervention.

In short, we were delighted with our results. Our new Redis client had processed an ElastiCache maintenance event smoothly with virtually no downtime. Furthermore, read operations on our ElastiCache clusters were being shared between slave nodes, which decreased latency on calls to our backend services and reduced the computational load on the cluster’s master nodes (See Figures 5 & 6).

Figure 5 Shared Computational Load Across ElastiCache Slave Nodes on One of Production ElastiCache clusters: The slave nodes in our cluster now share the load of processing read operations (instead of sitting idle as they did before)

With the new Lettuce-based client our backend service’s Kubernetes pods required around 10% more computational resources during peak hours compared to those using the old jedis-based one. We considered this increase acceptable, and quite reasonable in light of the tangible benefits our new client provided.

Takeaways

Our new Java Redis client has completely resolved our previous issues processing ElastiCache maintenance events. Maintenance events, which used to be the greatest cause of app downtime at Tinder, can now take place during the middle of the night without setting off alerts or waking up whoever is on call (see Figure 6). Even the addition of replicas or shards on the fly has not introduced issues; our client momentarily resets its view of the cluster and our services automatically recover.

Figure 6 Uninterrupted ElastiCache Operations: The smooth daily ebb and flow of our backend traffic. Our services have had no persistent issues communicating with our ElastiCache clusters during maintenance events (or at all) since we adopted our new Redis client.

The reception from other developers at Tinder has been overwhelmingly positive and we have since onboarded this new client across our critical backend services.

Going forward, we will consider the creation of an open-source Lettuce-based Java Redis client which incorporates all of our learnings through this process and addresses the issues we faced incorporating ElastiCache with Redis-clusters into our backend infrastructure.

¹AWS released an official Java memcached client that supports automatic discovery of Cache Nodes when they are added to, or removed from an Amazon ElastiCache Cluster, but there is currently no AWS recommended solution available for auto-discovery using Redis. (https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/ClientConfig.AutoDiscovery.html)

--

--