How we migrated aws elasticache (redis) cluster to another in production
I’ll Keep it simple. In my current organisation , we have a critical service which uses redis in backend as its database (and not as a cache). This is so critical that we cannot afford any downtime around it. This gets concurrent requests all the time (read/write). The traffic to the service can be from 500 write req./sec to 2000 write req./sec (24*7). As far as read is concerned , the traffic can be easily above 1000–1500 req./sec (24*7).
The redis that we use is aws elasticache. When this service was launched in late 2017, the redis-cluster that we had assigned to it, had 2 shards, total 8 nodes (4 in each shard). The instance type of each node was m4.xlarge.
Recently, we have been doing many cost optimisations in all the services that we have . Like Moving to aws ECS from Opsworks EC2 infra. Consequently, we had to downgrade above redis cluster too.
By downgrade, I mean, downgrade the instance type (we chose m5.large) and decrease the nodes in each shard.
Now the problem with Aws Elasticache is that there’s no such straight way to do this. Aws Elasticache comes with its own set of limitations like :
- Aws restricts few Redis commands in its environment : https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/RestrictedCommands.html
- You can not deploy your redis cluster in multiple regions , although you can keep it in multiple AZ (Availability Zones) within that same region.
The straightforward way for us to migrate the data from one cluster to another is by taking the snapshot of older redis cluster and recreating the new cluster with the snapshot taken. This was clearly not an option , as in our case, by the time we’ll take a snapshot and recreate a new cluster and move traffic to it, there will be multiple write calls coming to the existing cluster. This will lead to huge inconsistent data.
After thinking of different ways to migrate the data from one redis cluster to another and move traffic to it, along with the guarantee of zero downtime and inconsistency , we came across a solution.
- I created a new elasticache redis cluster (2 shards with 4 nodes of m5.large type i.e. 2 in each shard ).
- In our service, there are 4 APIs , 1 of which is a GET API (which ultimately reads from the redis). Remaining 3 APIs both read and write (create/update/overwrite values) to the redis. So I changed the code in such a way so that , whenever we are writing to the existing redis ,after this , we also write to the new redis. All this keeping in mind , that any error occurred while writing to new redis should not affect existing flow and also should not contribute to the latency also. For this , I used goroutines (as our service is in Golang) to call methods which had the logic to write to the new redis.
The error is expected in case of update/overwrite calls , as there won’t be any keys in to the new redis cluster. So , in this case I was just logging it.
3. After the above changes were deployed, and were stable, I used a python script which will scan through the existing keys from one of shards in existing redis cluster and copy (basically dump and restore) it to a shard of new cluster (see code and explanation for the script below). This will also overwrite any existing key data in new redis (if present because of 2nd step). This I had to run for both the shards.
4. After moving all the data to the new redis cluster , I again ran another python script which basically compares values for each keys in both redis (shard by shard).
After the 3rd step, as writes were already going to both the redis, data was successfully reconciled with 0 discrepancy except one ‘modified_time’. I’ll leave this to you for figuring out why.
5. Now , as the GET API was still reading from existing redis, I added a killswitch kind of logic which when enabled will read from new redis ,else read from existing (old) redis only. Basically, a killswitch is nothing but a boolean flag in your config .So, the code looks like this:
killswitch_flag = getconfig.getKey(killswitch_key)
if killswitch_flag is True:
We deployed the code with this flag as TRUE in the config. We monitored all this for couple of days, waited for any clients’ complain regarding inconsistent data, or even latency issues. Luckily, we got none of these.
6. So at last , all I had to do was remove whatever extra code I had added for writes to new redis and above killswitch logic. Along with that , I replaced the primary redis endpoint with the new redis cluster endpoint (in the configuration) we had created.
So at the end, all we had was same service , with a new redis cluster , but with the same data.
Script for migrating keys to new redis-cluster
In above code, current is an iterator with 0 as initial value. In line 39, we are executing SCAN command with current and along with 1000 value for the batch_size for the keys that it will scan. this returns 2 values, another iterator and keys (1000 max). Once , scanning for all keys are done, it will return the iterator with same value which was passed in the first time i.e. 0 in our case.
Also, I am using python’s multiprocessing library for faster execution.
One important point to note is that above script will fail , if your old redis cluster is of higher version than new redis cluster. This will throw ERR DUMP payload version or checksum are wrong (see this for details: https://github.com/antirez/redis/issues/3348). New redis cluster has to be either same or of higher version than existing cluster.
Script for checking inconsistency between 2 shards:
this script i got from below link, which itself is well documented:
Fast and Efficient Parallelized Comparison of Redis Databases
The process of comparing two versions of a database is a fairly common practice, generally used for testing and…
Few Points in the end :
- The above solution was inspired from the blog : http://elliot.land/post/migrating-data-between-redis-servers
- One good thing in our case was , that all the data that we had in redis was persistent data i.e. keys with no ttl (or ttl = -1). That was helpful in ensuring 0 inconsistency after the migration.
- We had done some load testing with the new redis infra that we were choosing , before finalising for the production environment.
- We had not done all the steps for migration in 1–2 days, instead we completed it in a week, one by one , monitoring along with it.
- Excuse me for the image above , I just searched migration in Unsplash and found only this as suitable. 😇
PS: In case you find any wrong/missing info, or you have a doubt regarding any point , you are welcome to add it in comments. Thanks