Multi-Tenant SaaS: How we migrated from on-premise Redis to Elasticache

Published in

SAFE Engineering

9 min readOct 5, 2021

Like many successful enterprise software products, our SAFE platform started as an on-premise solution. We then started our journey towards the cloud following a “Lift and Shift” (Re-hosting) approach as the first step. Migration was smooth and the product continued to work very well….but…

Re-hosting meant that we were essentially following a Multi-Instance architecture where we run all of our 25+ microservices in separate EC2 machines for each of our customers. This is not a sustainable solution from both engineering and business perspectives. The monthly AWS bill soon became the cause of concern and we decided that it was time to begin our journey towards complete Multi-tenancy before it’s too late. In this voyage, Redis was one of the first pieces we decided to tackle.

Our product was using a dockerized Redis instance for its caching needs. It was running along with the other microservices in the dedicated EC2 server for the customer. After evaluating the current and future caching requirements, we decided to replace this pattern with a single Elasticache deployment shared between the customers. Before we could make this shift, we had to make a couple of changes to the existing container.

Add authentication for the Redis container

But wait, why do you need to keep maintaining the Redis container and introduce authentication if you are planning to move towards Elasticache? Why don’t you let it have a quick and painless death?!

Well, we were not ready to say goodbye, yet….

But why?

Backward compatibility: We have multiple production deployments running in multi-instance mode and the transition towards the multi-tenant architecture cannot be achieved in a single step. Until we have completely separated the data layer and compute layer of the product and tested it thoroughly, we cannot go ahead and migrate the existing customers to the new pattern. So both Redis container and Elasticache should be supported, for the time being.
Developer convenience: While we embarked on this journey, it was decided that we will do it in a way that’d cause the least amount of inconvenience to our rockstar developers. Serverless and Cloud Managed services are exciting and solve a lot of problems for us. But we know from experience that it can take a toll on developer productivity. Development is not as easy and fast as working on a Docker-based microservice. We wanted our teams to continue developing and debugging in their dev setups without feeling any difference — as much as possible.
Parity: Since the developers will continue to use Docker-based Redis, we have to reduce the chances of something breaking when a newly developed feature is deployed in a setup configured with Elasticache. So both systems should be compatible with each other in terms of functionality, authentication, data separation, etc.

Redis versions above 6.0.0 support username and password-based authentication. Let the username be testuser and password as password . To enable authentication, we have to first calculate the digest of password

 echo -n “password” | sha256sum | head -c 64
> 5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8

Then turn off the default user and add username and password digest in redis.conf .as follows

user default off nopass ~* &* +@all
user testuser on #5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8 ~* &* +@all

Here the second part of the line describes the ACL rules (Access Control List). The rule here means that the testuser has access to every possible key (~*) and Pub/Sub channel (&*), and can call every possible command (+@all).

But in a multi-tenant setup, we can’t let a tenant access “All” keys. So now what?

Data isolation

In a multi-tenant instance, we need to make certain that a Tenant can only access the data that belongs to them. It’s not enough to handle this in the application logic. Developers can make mistakes and we need to make it impossible for one tenant to access the data of another tenant even if there is a bug in the code.

To achieve this in Redis, we need to create different key-namespaces for each of the tenants. This can be done by having a dedicated key prefix (Tenant ID) for every key corresponding to a tenant. As in, the keycompany_id for the tenant tenant1 would become tenantid1:company_id . Then we need to configure permissions in such a way that one tenant cannot access the keyspace of another one.

Handling the key prefixing individually in the code everywhere we store/access a value in Redis is not a small feat. It is not the right thing to even if it’s easy. Chances are high that a new developer will not have this “tribal knowledge” of how multi-tenancy is handled and will easily make mistakes.

Luckily, most Redis clients do support Transparent Key Prefixing. In our codebase consisting mostly of Typescript, we have been using “ioredis” as the client library. In which, we can easily define the keyPrefix for the Redis client and export it for all other files to use.

This made the code changes straightforward and developers can keep working their magic without worrying about authentication or data separation. All was well, until….

Some of the functionalities started to break with this Redis client! Certain methods were not able to find the keys they were looking for in Redis. After some debugging, We found out that Transparent Key Prefixing (TKP) does not automatically apply to some of the Redis methods like keys and scan , which take patterns as input rather than actual keys. Also, TKP will not automatically remove the prefix from the key names returned by such methods. This is not a bug but by design, but this offered little comfort as our use case remained unsolved. This meant that we need to handle these cases ourselves in the exported client. Again, it should be done such that the implementation is abstracted away from other developers. We can do that by extending the Redis class and overriding these methods to handle keyPrefix , as done for the keys method below.

So we have a username, password, and keyPrefix . But what prevents one Tenant from accessing the key namespace of another Tenant? Nothing! Yet.

In one of the earlier snippets, we had modified the redis.conf to add a test user, and that user had permission to access all keys and perform all actions. Let’s change that a bit. Instead of creating a user with access to every possible key (~*), Let’s create a user corresponding to the TENANT_ID , who has all access ( +@all) only to the key namespaces starting with the prefix TENANT_ID: (ie. ~${TENANT_ID}:* )

user default off nopass ~* &* +@all
user ${TENANT_ID} on #${REDIS_PASSWORD_SHA} ~${TENANT_ID}:* &* +@all

Security Warning: Using +@allas the command category comes with certain risks. For example, +@allwould mean that the user can list all of the key names in the keyspace (using keys *command) including those which does not match the key prefix defined in the rule (even though the values of those keys cannot be accessed). Hence the key name should not have any information that is sensitive in nature. The user can also perform some dangerous operations such as flushall when the category is +@all . So the command category should be carefully chosen to be as restrictive as possible based on the application needs (Refer to the doc for more details).

At this point, all of our microservices were using authentication and keyPrefix while accessing the Redis container. Thus our containerized Redis itself could serve as a multi-tenant cache if different Tenants were connected to it. Neat! Now it was time to move to Elasticache.

Migration to AWS Elasticache

AWS Elasticache offers a fully managed Redis service, which means migration could be as simple as changing the connection strings and one should be able to hit the ground running. It offers three types of deployments (single node, cluster mode disabled and cluster mode enabled). Explaining more on this would be out of the scope for this article and more details can be found here. After analyzing our use case we decided to go with a single shard (cluster mode disabled) deployment with 2 Nodes ( 1 primary and 1 replica) with Multi-AZ enabled.

The instance was created in the private subnet and was assigned a security group that allows inbound connections from the security group corresponding to the EC2 Instance (where the tenant application will be running) on port 6379 (Default Redis post)

Once deployed, find the Primary Endpoint from the cluster details.

Our application would be using the following config to connect

REDIS_HOST=master.<primaryendpoint>.amazonaws.com 
REDIS_HOST=6379

Elasticache provides two methods of Authentication.

User Group Access Control List
Redis Auth Default User

In our case, we would be having different users for different Tenants and defining their ACL based on the key prefix. Hence the right option for us is the User Group Access Control List.

While creating an Elasticache User Group, adefault user should be selected. There will always be an AWS-managed default user, but this user cannot be modified or password protected. So it’s recommended to create a new User with User Name as default from User Management

Another User for the Tenant can also be created from the same page but this time the Access String should be tailored based on the Tenant ID (key prefix). As in, for a tenant id tenant1001 Access string would be

on ~tenant1001:* +@all

Note: The ACL is slightly different from the one we saw with Redis container as the pub/sub channel rule (&*) is absent here.

Now we can create an Elasticache User Group from User Group Management. While creating, we should be selecting both the default user and the tenant user to add to the group. This group can be modified later as we have more tenants.

Once the group has been created, we can select the Redis cluster deployment and Modify it to add User Group based authentication and select the user group we created.

Almost there. Since we have enabled “Encryption in transit” and “Encryption at rest” for the cluster, we’d need to enable TLS in our Redis client. For that, let’s modify our Redis client to add tls: {} field to the Redis options if Elasticache has been enabled.

That’s it! The application should be able to connect to Elasticache and use it instead of the Redis container.

Of course, manual creation of the cluster and configuring the tenant users for every deployment is not sustainable. So, all of our deployments and configurations are automated using AWS CDK, but that’s a topic for another post!😊

Hope you enjoyed the article. Happy to answer if you have any questions. If you find it useful, please clap and follow us for more!

Multi-Tenant SaaS: How we migrated from on-premise Redis to Elasticache

Add authentication for the Redis container

Data isolation

Migration to AWS Elasticache

Written by Deepak Sreekumar