Redis solved our high-volume caching challenges — with a little creativity
By Karthikeyan Ramar + Nelson Chu, Software Developers at Expedia
The hotel inventory data caching team at Expedia was tasked to come up with an instant caching solution that was scalable, reliable and highly available in the cloud and in the data center. This would ensure travelers receive up-to-the-second(s) pricing and inventory updates with no surprises throughout their booking experience. We were excited about the work and the customer impact it would have. This article walks you through the obstacles we overcame, limitations that guided our decisions and how the solution shaped up.
Specifically, this article details the solution to the challenges listed below.
- Design a resilient and highly available data store solution using Redis
- Replicate data from cloud to data center
- Execute a data replication strategy for a scalable, massive fleet of Redis servers (1500 for now)
What you should know to get the most from this article.
We wanted it fast. We wanted it blazing fast.
We needed our solution to be resilient, scalable and speedy. Our target was to get the data from suppliers to end users within few seconds. To be more precise, we needed to collect the inventory updates hoteliers provide, to store it in the master database, to capture the changes, to transform the normalized data model into search friendly, cache it and serve it to the pricing engine as fast as possible. This latency is referred as cache latency throughout this article. We were on the lookout for real time streaming and data store solutions to handle close to 400 million hotel inventory updates per day.
Every component in our system had to enable us to deliver cache latency of a few seconds, in addition to meeting all other requirements, including scalability, reliability, resiliency, ease of operations and optimized costs.
This diagram represents a simplified version of our architecture. Below the diagram, we’ve included more details about our use of Redis and Kinesis.
Kinesis Data Queue
For queuing the data to process, both Kafka and Kinesis met our read/write throughputs. Kinesis was chosen for a few specific reasons:
- Since it is an AWS service, no maintenance operation was needed.
- Kafka needs zookeeper service to store the record offsets.
- Cost-wise, 90GB of data stored in 8 Kafka brokers was almost equivalent to 60 Kinesis shards managing the same. 60 Kinesis shards were needed to meet the read and write throughput of the application.
- Kinesis Client Library demands the client to maintain the queue offsets somewhere.
- Kinesis throughput is lower compared to Kafka, but sufficient for our needs.
We just had to figure out how many shards were needed to achieve the desired read throughput.
Data Processing (Caching Service)
We decided to stick to a simple Java main application to read and process every message from the queue and write it to a data store. Five instances of the app can process approximately 1 billion records in an hour during cold bootstrap.
Hosting our Redis Servers
After trying out some other caching solutions, we decided to see what we could get out of Redis. Redis is touted as a fast, open source, in-memory key-value data structure store. We ran a proof-of-concept hosting a couple of Redis instances for caching hotel information and the results were promising.
As you see in the diagram, two data stores are used. Internal data store uses Redis in cluster mode and external data store uses Redis cluster in standalone mode.
Let us dive deep into how the external data store is realized using Redis cluster in standalone mode.
Why Redis in Standalone Mode
- Lower latency and higher throughput — Redis standalone mode supports a feature called “pipeline” to further improve the read latency and throughput.
- Standalone mode gives more flexibility and control when replicating on a massive scale
An external data store must replicate data to our data center from AWS. We were bummed to find out ElastiCache does not support replication out of AWS. This forced us to host our own Redis servers and experiment. We began by hosting Redis nodes in AWS to help us manage close to 350GB of data, with room for growth in the future. Since Redis is single threaded, the computing power can be scaled up by adding more instances to the desired limit. This posed many challenges:
- How to monitor the health of the system?
- How to do the failover?
- How to restore the state of the system and data?
- Doing all the above perfectly in middle of a night?
We knew we needed automation in a big way if we wanted to keep inventory and pricing data highly available, as well as handle failure recovery super effectively. Since failures can happen anytime, we needed our solution to reduce manual work (people make mistakes) and minimize our need to have people on-call.
CloudFormation — just the way we like it!
Redis cluster in standalone mode is used for external data store. In standalone mode, the client must decide the data sharding algorithm to read and write. Based on data size and latency requirements, we zeroed in on 24 shards. How we automated our system monitoring and recovery is detailed below.
A CloudFormation template was created to spin up 48 standalone Redis instances, 24 masters and 24 slaves. The CF template also deploys a few “single responsibility” scripts along with Redis to automate several steps to monitor and recover. ASG invokes most of those scripts based on an event:
- CloudFormation template creates a Dynamodb table to keep the Redis nodes and their corresponding shard number on creation.
- On creation, the Bookkeeper script determines the shard number for the Redis node from the dynamodb table entries and makes an entry for it. Dynamodb atomic operations are used for synchronization. The first node to join a shard will be a master. A CNAME will be created for the master node. The following Redis node to join that shard will be a slave and it finds it’s master by the CNAME.
- When a Redis node dies or fails, ASG posts a message to a SNS topic. That triggers a “Cleaner” lambda to remove the corresponding entry from the Dynamodb table. Also, an email notification is sent for monitoring.
- Wait, now who does the failover? Redis Sentinel.
- Redis Sentinel provides high availability for Redis. Sentinel promotes the slave as master. Sentinel invokes a “fixer script” to reassign the master CNAME to the newly promoted node. Sentinel also then re-points the slaves of the deceased master to the new master. Of Course, a notification email was sent.
- ASG spins up a new Redis node. Bookkeeper script determines and sets the shard number, role and the master CNAME to slave of for the new node.
- The new slave syncs data from the master
- Our sleep goes undisturbed.
The internal data store is more simplified than the external one. It does not need to replicate data across AWS boundaries. It does not need to serve a massive fleet of pricing engines.
With a published ElastiCache limit of 15 Redis instances in cluster mode, the read/write throughput was not sufficient enough to achieve the desired latency. We ended up hosting the Redis cluster ourselves with 35 masters and 35 slaves to manage 900GB of data. Failover is taken care by the cluster itself, so there is no need for a sentinel as in external data store.
Replicating to 1500 Redis servers
As indicated in the diagram, external data is replicated to 1500 Redis instances in our data center. Here’s how we managed to replicate data to so many servers.
Layering is the key. Redis allows a node to be a master and slave at the same time. The number of slaves per master was determined based on the load the data syncing puts on the master. Also, Redis has a configuration to decide whether a slave can be promoted to master on failover. This helps us to prevent any slave in the data center from being promoted to master, which prevents the caching service from writing from AWS into our data center servers and reduces latency.
With a 1:N master-slave relationship at the bottom most “traffic serving” Redis layer, when a full snapshot data sync happens, all the N nodes were syncing data at the same time and thus not available to serve traffic.
With chain slaving employed (master => slave 1 => slave 2……..=> slave N), when a full data sync happens, only 1/Nth of Redis capacity is lost at any given time, thus more Redis capacity is available to serve traffic. Of course, it increases the data propagation time down the layer. We found this better than being unavailable to serve traffic.
Automate if you can.
Manual efforts are tedious and error prone at any time. We love to use proven infrastructure solutions to deploy, monitor and maintain our system. Unfortunately, our requirements meant sometimes we could not automate. Yet it opened a huge opportunity for us to experiment, learn, accomplish and share the knowledge. The more scenarios we automate, the more we are obligated to get it streamlined and easy to use.
Did we automate everything?
We did not automate everything, but we did so for most of our scenarios.
What happens if a master and slave go down at the same time? Though the Redis nodes will be restored, manual work is needed in such rare occasion to restore the lost data.
We set out to find the right caching solution for our cloud implementation. Redis and AWS CF turned out to be our best options for resiliency, scalability and maintenance. Our system continues to evolve with the business needs every day. We’d love to hear your questions and comments below.