WalkMe Engineering

Creative Engineers helping organizations become the best they can be

Horizontal Scaling of a Stateful Server with redis pub/sub

--

Introduction

What happens when your app is getting more and more traction and you need to quickly scale a 7 years old monolith code base?

After scaling vertically to the limits of your budget (or AWS capabilities) you might, like we have, reach a certain wall. The single machine just can’t handle that much traffic.

The first answer that comes to mind is “move to microservices architecture”. When dealing with a 7 years old codebase, it should be done gradually. But a solution for the current load must be implemented. This article describes the solution — how to scale an old stateful monolith horizontally.

As your app is getting more and more traction, you need to scale. The old stateful monolith is no longer suitable for vertically scaling. In this post I will walk you through a process I’ve recently experienced while transforming a stateful service into a stateless multi-instance service.

Key services in your app should always be scalable. You would not want your main logic to fail due to some temporary peak in traffic. In many cases, scaling horizontally (e.g. working with multiple instances of a single service) is the best option to handle load. When your service is stateful, it is hard to accomplish (some would say “hard” is an understatement).

WalkMe has billions of hits to its API in a very short period of time. It was not always the case. The old monolith used to take the old load pretty well. What used to work when WalkMe was a small startup, does not work now when it is a grown 2B$ worth company.

This is the story of scaling an old monolith by converting it to a stateless service using redis pub/sub (and why we didn’t choose tools like Amazon ElastiCache and Amazon SNS.

The Problem

The load on the server used to be around 22.5 million monthly requests, and from month to month the number increased. In one year, we reached more than 30 million monthly requests (newrelic SLA attached).

App server requests processed

Accordingly, the response time increased, and the customer experience was damaged.

App server average response time

Moreover, and maybe the most painful part, is the fact that when the single server had downtime, the application itself had down time as well. We couldn’t afford ourselves to stay in this situation.

Moving to microservices is not so easy

An obvious solution for such a problem is the microservices architecture. It is definitely a work in progress here at WalkMe.

In our case, we are talking about a monolith that is a 7 years old project. It’s not so easy to break it down. We started the process gradually, by extracting code pieces to microservices. you can hear about the process in this meetup I gave.

But until this process will be finished, we must make a quick fix for the situation.

Step 1: Define the state

As mentioned above, the problem with multi instances of our server is the state. The first thing we did was to define our state so we would be able to extract it and make our service stateless.

Our state consisted of two cache mechanisms:

Self-hosted files cache: Self-hosting is a deployment option WalkMe offers in order to enhance the security of WalkMe on your app. Self-hosting allows you to keep all of your WalkMe content files on your servers. This removes the dependency on external servers for the app’s ongoing activity. These files are accessible to our users through the back-office system. Because downloading the files from S3 is very slow, we keep these files in cache for a better user experience when downloading.

Design templates files cache: Used as part of “Walkme Gallery”. The WalkMe Gallery allows users to apply pre-made templates on balloons and Shoutouts. We are keeping the images and the necessary related files in the cache to decrease the gallery loading time.

Figure 1: The WalkMe gallery. The data shown is taken from the Design Templates cache.

The purpose of both caches is to minimize the files fetch time from S3. This results in reducing the load time to the user, and it is much faster to access than S3 download. For reference, it took almost two minutes to download 1.5GB~ from S3. Accessing the same data from disk memory took ~20 seconds — around 600% faster.

The caches are refreshed when the S3 files change, so the server has up-to-date data at all times.

While improving our app’s performance, it makes our server stateful.

Table 1: summarizes the two states and their mechanisms

What’s the problem then?

The servers that handle the cache can stand a certain amount of traffic. Since WalkMe has grown in size, the single server is unable to take the load. The obvious solution would be to use a load balancing technique.

We will load more servers that will hold the cache and balance the load on each.

While this is relatively easy to do (just spin a .Net server that serves data), it raises a different problem. Figure 2 illustrates how syncing the data between the servers when it changes (in our case, a cache refresh) becomes a problem.

Figure 2: Our problem of moving to a multi-instance architecture

Figure 2 describes the process in a single (gray area) and multi-instance (blue area) scenarios. In the single instance scenario (left side) — calling the refresh-cache function after updating the S3 will activate it on the single instance that got the request. In the multi-instance diagram in figure 1 (right side), on calling the refresh-cache function, only one random server will refresh its own in-memory cache.

If a user requests a design template, it will be routed to the most available server by the load balancer. This server might not be the one that was updated via the refresh mechanism.

Without a syncing mechanism, the other servers will still hold the old cache.

Step 2: Research Caching Solutions

When we started, we first considered Amazon ElastiCache, a fully managed in-memory data store, compatible with Redis.

ElastiCache is a solution that enables to extract cache from the server itself to a shared distributed cache. Sounds perfect right?

Figure 3: Elasticache architecture (Diagram by Amazon)

Figure 3 shows the Elasticache architecture. There are 2 (can be more) servers behind a load balancer. The cache is not stored on the servers themselves but in third-party storage. Refreshing the cache is a non-issue here since we have a single source for the data.

While this solution seems to solve both our issues (load balancing and data sync) at the same time, there are some limitations that made this solution not good enough for us:

Files size

Our cache stores big files. Amazon ElastiCache for Redis, being a single thread server, is not designed for storing large objects. One request will be fast enough; more than one request will be slow because they all will be processed by one thread. You can read here more about Redis performance issues.

An extra step on the way to the client

So far, the cache was stored on the server itself (ROM/RAM). With this solution we are adding another step — the server would need to read from the cache. This would cause an unwanted delay in retrieving the files.

Costs

ElastiCache pricing is very high when we speak about big files. Keeping the files on the server will reduce costs dramatically.

Time Limits

After investigation, our estimation for changing our servers to cope with the new architecture was too long. We needed a quicker win.

If the “out-of-the-box” solution is too expensive, we should look at an alternative. We thought of using a pub/sub messaging system as our solution. How can a pub/sub system help here? Read on!

What is Pub/Sub Messaging?

If you are familiar with the concept, you can skip to the next chapter.

Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to-service communication used in serverless and microservices architectures. In a pub/sub model, any message published to a topic is immediately received by all of the subscribers to the topic. Pub/sub messaging can be used to enable event-driven architectures, or to decouple applications in order to increase performance, reliability and scalability.

[Quote from AWS blog]

Step 3: Pub/sub distribution system

Because extracting the cache using Amazon Elasticache does not fit our case, we tackled the problem from a different approach.

Our original problem, as illustrated in Figure 2, was out-of-sync instances when refreshing the cache. Instead of trying to extract the cache out of the servers, we looked for a way to sync the cache in all the servers. The solution will look something like Figure 4.

Figure 4: The solution architecture to sync the servers. The Pub/Sub (3 question marks) mechanism will be responsible to pass syncing messages between the servers and make sure we have thumbs up everywhere

In this architecture, every instance can act as a subscriber and also as a publisher. All the instances will be subscribers, as they are subscribing themselves to the relevant cache refresh topics (topic is the SNS term for a channel) when they load.

During a refresh phase, only one instance will get the refresh command. It will then publish the change via our pub/sub system. All servers, including the publisher who listens to the topic, will receive the message and update their cache — hence we will have multi instances with the same cache.

A comparison between what we had before the change and after the change is summarized in table 2.

Table 2: a summary of the planning phase. The “Before” column is the description of the mechanism before the change. The “After” column is the desired behavior of the mechanism described in the relevant “Before”.

All we need to do now is to find the right pub/sub technology to work with in order to sync our servers.

The first pub/sub technology we tested was Amazon SNS.

Take 1 — Amazon SNS

We were already dealing with a lot of AWS products. That, combined with the fact I already had experience with Amazon SNS, made it my first choice for a pub/sub system.

Figure 5: SNS architecture (Diagram by Amazon)

So, what is exactly SNS and how does it work?

Amazon Simple Notification Service (Amazon SNS) is a web service that coordinates and manages the delivery or sending of messages to subscribing endpoints or clients.

In Amazon SNS, there are two types of clients — publishers and subscribers — also referred to as producers and consumers. Publishers communicate asynchronously with subscribers by producing and sending a message to a topic, which is a logical access point and communication channel (see the cloud in figure 5).

Subscribers (there are various types of them — can be seen in the right part of figure 5) consume or receive the message or notification over one of the supported protocols when they are subscribed to the topic.

In my previous experience with SNS, I used it as a mail distribution list tool. In that case the subscribers were mail addresses, and the publish action to a topic was sending the published message to all the subscribed mails.

Now, I want to use the SNS in a different way. The subscribers will be the server instances. More specifically — an instance endpoint that refreshes the cache. In addition, each subscriber will also be a publisher if it was the first to refresh the cache.

Why didn’t we use Amazon SNS?

On calling my subscribe function, I got this error message:

Not authorized to subscribe internal endpoints (Service: AmazonSNS; Status Code: 403; Error Code: AuthorizationError; Request ID: 14bc8224–4b0d-5b9a-a61c-d24122fdf703)

I found this line in AWS documentation:

VPC endpoints don’t allow you to subscribe to an Amazon SNS topic to a private IP address.

That means that if I wanted to subscribe my instance endpoint to the SNS topic, I had to turn the instance’s IPs into public IPs. This is something that we would never do, due to security reasons. In our architecture, and as a best practice, only the load balancer has a public IP that is opened to the internet.

Moving on to my next try — Redis pub/sub.

Take 2 — Redis pub/sub

Redis, which stands for Remote Dictionary Server, is a fast, open-source, in-memory key-value data store for use as a database, cache, message broker, and queue.

[Quote from AWS documentation]

While we decided not to use a third party cache, Redis, thanks to its abilities, can be utilized also as a pub/sub mechanism.

The idea is the same as the SNS concept (see SNS diagram above in figure 5), but there are some differences though:

  • There’s no need to explicitly open a topic. Redis pub/sub creates a channel when a consumer subscribes to a non-existing channel.
  • Redis pub/sub accepts a function as a subscriber, and not an endpoint address as SNS does.
  • There’s no need to unsubscribe from Redis pub/sub; unlike SNS that would save the endpoint as a subscriber until it’s deleted, Redis will remove the subscriber immediately when the server is shut down.

All those points made Redis pub/sub much easier to use than SNS.

The complete solution

Now that we have our pub/sub system in place, we can finally say: mission accomplished. We can now add the pub/sub solution to our architecture to complete the puzzle.

Figure 6: The stateful monolith (left), the synchronization problem (middle) and the fully implemented solution(right)

Some implementation Details

A few practical tips for those of you who want to use redis pubsub. These can save a lot of valuable time.

First, I’ve chosen to use the StackExchange Redis client to interact with Redis. After investigating the various clients for C#, this option was the best due to some reasons:

  • Less limitations on connections / publish actions / subscribers
  • Open source and free to use
  • “Breathing” community and a lot of contributors
  • Thousands of monthly downloads

The second tool I would like to recommend is a Redis GUI. To test Redis connections and subscription after everything is set up, I used an easy to use recommended GUI: Redis Commander.

Download > npm install > node bin/redis-commander.js > Browse to localhost:8081

In the GUI, to connect to a server you should insert the specific redis server URL. These are the Popular CLI commands that you can use for testing your environment:

Publish a message to a channel: PUBLISH channel_name message_content

Check how many subscribers exist for a channel: PUBSUB NUMSUB channel_name

Summary

In the article I shared with you my journey from single to the multi instance stateful servers.

Our problem was that a single stateful server held cache and it needed to scale. Just adding more machines would not do, because then the cache in each server will not be synced after a cache refresh/update (figure 2).

Our first attempt was to use a third party cache (AWS Elasticache). This wasn’t the best option for our use case, mainly because it does not fit for large size files — both in performance as well as pricing.

We tackled the issue from a different approach — by syncing the state across all servers using a pub/sub system. We saw the considerations in choosing redis pub/sub over Amazon SNS — the main issue being SNS not allowing for private endpoint addresses.

Finally, we saw some implementation tips on using the technologies we chose like the stackexchange redis client and Redis Commander.

Thanks to Yonatan Kra and Minor Ben-David.

--

--

WalkMe Engineering
WalkMe Engineering

Published in WalkMe Engineering

Creative Engineers helping organizations become the best they can be

Yedidya Schwartz
Yedidya Schwartz

Written by Yedidya Schwartz

Backend Tech Lead | DevOps | AWS Community Builder | AWS Solution Architect

Responses (2)