Scale AWS ElastiCache

Playing with node types with minimal service impact

Published in

Expedia Group Technology

3 min readJul 30, 2019

As we migrated our online service to AWS, we also migrated our service’s cache layer from Redis to AWS’ ElastiCache for Redis, an in-memory data store built on open-source Redis and compatible with Redis APIs. We learned how to perform this migration with minimal impact to the service.

We managed to pre-populate the new cluster starting from a dump of the old one using CloudFormation (CF), but…

How Can We Improve Cache Throughput and/or Storage Capacity Without Service Degradation?

Looking at the Amazon ElastiCache features, we found:

Online resharding

Online resharding can be done without service degradation, but you can only perform the following changes:

Scale out: increase read and write capacity by adding shards
Scale in: reduce read and write capacity, and thereby costs, by removing shards
Rebalance: move the keyspaces among the shards

Unfortunately, none of these fit our use case.

Offline resharding

This kind of operation causes a service outage since it needs to destroy and then recreate the cluster to apply the changes. These are the changes that are permitted:

Scale up/down: change the node type in order to increase/decrease node space capacity
Change Redis engine version (list of supported versions here)

Scaling up Is What We Are Looking for but…

We need to guarantee cache operation during the scaling process!

To scale up without causing a cache outage, we need to create a second CF stack, fill it with the original stack data, and then swap the cache endpoint that our service is using.
We need to separate the CF resource creation from the application deploy and also duplicate the resource CF template. Once you’ve done so, you can start this procedure:

Perform a snapshot of the original Redis cluster (C1) and take note of the snapshot name
Create a new stack for the new Redis cluster (C2), specifying in the C2 CF template the new NodeType (in our use case, a bigger one) and the C1 snapshot name to create C2 populated with the data from C1
Once C2 is created, update the Redis endpoint in the service configuration and redeploy the service to swap the Redis cluster
Delete C1

Trade-off

In order to not cause service performance degradation, we need to leave C1 running during C2 creation. In the time between the snapshot creation and the service endpoint swapping, all the new cache keys created in C1 will not be present in C2.
The impact of this “data loss” depends mainly on the dimension of C1: The bigger the snapshot, the more time is needed to create the snapshot itself and create a new cluster starting from that snapshot.