Controllable Scaling in Serverless Big Bang

8 min readNov 11, 2020

With the appearance of microservice architectures, communication using messages become common practice in today’s system design. Tightly coupled monolithic applications are rarely the choice of today’s developers or architects. Such applications mostly exist as legacy applications, waiting to be discontinued at some point in the near future.

However, with message-based communication between microservices, we gained some new challenges. For instance, overusing of messages and making the system far more complex than it should be, or making one microservice dependent on the message that might never arrive, or having generic message types that have no exact purpose and might lead to misuse if incorrectly understood by a developer…

But in this article, I will discuss latency that can build as a result of queueing messages. Also, I will supply some ideas on how to mitigate those problems and avoid overwhelming source systems with minimal effort.

Where do we make mistakes?

I am sure that all of you had a situation where you need to process some messages or events faster than the others. Some of the messages require different priorities when processed, as well as in real-life cases.

For instance, if we take a hospital reception desk as a message router. People with urgent medical needs, such as a heart attack, need urgent attention in comparison with people that have common flu symptoms. So by evaluating the person’s condition, the receptionist is routing those patients to the specific hospital section in a prioritized way.
In case, you have all patients in the same queue, some of them might die waiting :(

Similarly to patients in a hospital, some messages and events need quicker attention. Let’s walk through a real-life scenario with a perfect architecture but also with one shortcoming :)

Sluggish e-mail campaign solution

We had a perfect idea to create a tool for email campaigns out of a few already existing external applications. So, we have used CMS to create HTML content for the email.
We have integrated CRM to personalize email content by collecting user Firstname, Lastname etc. Also, to create target users, we had to integrate Segmentation DB.

The solution worked similarly to the assembly line in a factory. Lambdas were responsible for processing work while data was carried over Kinesis stream from lambda to lambda.
Generate Template Lambda was responsible for assembling the HTML template created in CMS using WYSIWYG editor.
Prepare Targets Lambda was responsible to retrieve users from the segment that are belonging to certain marketing campaign targets.
Populating Lambda was responsible for pulling user details from CRM and assembling the email data JSON.
Email Sending Lambda was responsible for replacing placeholders in the email HTML templates, populating them with data received, and sending them to the users.
All in all one great solution on how to reuse already existing infrastructure and produce new value using a serverless paradigm.

However, some things didn’t go as we had hoped…

Facing a sad reality…

Everything worked perfectly for quite some time. We were able to send more than 5 million emails per day.
However, one day while having a few long-running global campaigns in parallel, marketers wanted to have an urgent one-day Valentine's day offer.
So they published it…
Unfortunately, stats were showing no deliveries for more than a few hours. There was no email coming out for a whole day? Marketing managers were in a panic, more than half-day has already passed and no email has gone yet.
When we checked the logs everything was functioning perfectly, there were emails coming out but not for Valentine’s day campaign.
We have checked the campaign everything seemed fine, except iterator age on kinesis that was triggering Email Sending Lambda. Iterator was 156960405 ms old or almost 2 days old. So if you publish a campaign right now, you would get the first emails coming out in 2 days :)
There was nothing smart we could do at that time. You may say “you could just re-shard it and scale it up” but kinesis re-sharding is something that we avoided for a good reason…
So we had to let the campaign fail and break so many hearts for Valentine’s day :(
Customers received emails two days after Valentin’s day had passed.

Why didn’t we just scale everything up?

There is one easy solution to the above situation “just scale-up and the job is done…” Even though that might be very often the quickest and easiest solution to overcome big problems, it might, also, be a ticket to disaster.

From the above example, the solution is calling two external systems for the information (SegmentationDB and CRM). In case we would scale up our solution, the load will be moved from our solution to external systems. That would mean that we did not solve the problem but just moved to other systems that we depend on.

As external systems might be a legacy or on-prem, they also might have a limited capacity to scale. Due to increased load, there is a big chance that your solution may potentially cause congestion at the external systems’ side. With that said, users or other systems that integrate with those external systems may experience latency or even unavailability. As your solution depends on those systems it will also experience latency or even failures. What you may experience is a cascade failure on a higher scale.

In the end, the idea of scaling up could be just a shortcut to disaster.

If scaling up was a bad idea, what did we do then?

The plan was not to increase the capacity of processing but still be able to process urgent messages quicker. That leads us to message prioritization in the queue.
We have determined with marketers that there should be 3 types of priority when publishing campaigns (HI, MED, LOW).

Unfortunately, nor Kinesis nor SQS supports the prioritization of messages AWS suggests provisioning queues according to priority. So there should be as many queues as priorities. In that way, we could process messages at a different pace.

So after a few days of thinking, the solution popped up which you can observe below.

Certain facts had to be considered while making architectural decisions. I will try to explain them below.

How are priority queues eliminating the problem?

As said before, there is no such option to sort messages based on priority in Kinesis nor SQS. Therefore, message queues need to be split into 3 queues or as many as you have priority types.
But how do multiple queues eliminate the problem? Well, if we consider that messages are terminated at the lambda side and that as quickly as they reach lambda they will be executed, we can conclude that by routing messages to the queue with fewer messages the execution will be faster.

Why is there a need for both Kinesis and SQS?

Well, both services are queueing data but they are doing it differently. Kinesis is buffering data and it can buffer as much as you can put in 1000 puts/second but not more than 1MB per put request. So virtually in a day, you can put 86400000 MBs or 86.4 TBs on one shard.
So yes, lots of data… but be careful your data retention is only 24h by default or max 7 days if configured.
Similar to Kinesis, SQS can accept a virtually unlimited number of messages per second (if FIFO then 3000 rec/sec), but at a max size of 256KB (or up to 1 GB in combination with S3).
Again lots of data… and again you need to be careful regarding data retention which spans from 2 days to 14 days (if configured).

From the above said, SQS and Kinesis seem very similar. What is the difference?

Well, the catch is in how they operate with lambda. It is very important to know in advance the throughput capabilities between lambdas and source system from where you consume necessary data. It frequently happens that you realize that when the system is already in prod.
However, let’s assume that an external system that runs on-prem has limited scalability. Considering that fact, it might happen that virtually unlimited scalability of serverless architecture may overwhelm the external system.
So which one should be used and when? Well, SQS is scaling based on queue depth or the number of messages in queue currently. Optimally, Lambda functions with an Amazon SQS trigger can scale up 60 additional instances per minute to a maximum of 1,000 concurrent invocations (depending on the soft limit). So little under 17 minutes, you will have 1000 concurrent lambdas hammering your external system. At one moment lambdas can knock the external system down.
Now, about Kinesis… Kinesis is scaling by splitting the shards. Each shard can trigger one lambda and process a configurable batch. This means that you can tune how much data will be acceptable for an external system to process.

Why SNS in front of Kinesis and SQS?

SNS plays a key role in serving to fanout messages to priority queues. Routing messages is done via SNS message filtering. So by attribute defined in the payload of the message, SNS is able to notify specific subscribers. This is a great way how to utilize Serverless AWS services like SNS for such business logic in a scalable way.

In the example below there is an example message send to SNS. It has payload under attribute “message” and it contained a filtering attribute under “MessageAttribute”.

{
    "Type": "Notification",
    "MessageId": "a1b2c34d-567e-8f90-g1h2-i345j67klmn8",
    "TopicArn": "arn:aws:sns:us-east-2:123456789012:HighPrio",
    "Message": "{ \"key1\":\"value2\", \"key2\":\"value2\"}",
    "Timestamp": "2019-11-03T23:28:01.631Z",
    "SignatureVersion": "4",
    "Signature": "signature",
    "UnsubscribeURL": "unsubscribe-url",
    "MessageAttributes": {
       "priority": {
          "Type": "String",
          "Value":"MED"
       }
    }
 }

As I like Cloudformation a lot, I will describe what it is necessary to be defined in the template. So, you need to define the topic where messages will be published called “MsgPrioTopic”. Then, define queues for “MED” and “LOW” priorities. Finally, subscribe queues to the topic based on the filter policy (MED & LOW).

MsgPrioTopic:
    Type: "AWS::SNS::Topic"
    Properties:
    TopicName:"MsgPrio-topic"MsgMedQueue:
    Type: AWS::SQS::Queue
    Properties:     
      QueueName: !Sub "${AWS::StackName}-Msg-med-queue"MsgLowQueue:
    Type: AWS::SQS::Queue
    Properties:     
      QueueName: !Sub "${AWS::StackName}-Msg-low-queue"MsgMedTopicSnsSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      Endpoint: !GetAtt MsgMedQueue.Arn
      Protocol: sqs
      RawMessageDelivery: true
      TopicArn: !Ref MsgPrioTopic
      FilterPolicy:
        priority: 
          - MED
  
  MsgLowTopicSnsSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      Endpoint: !GetAtt MsgLowQueue.Arn
      Protocol: sqs
      RawMessageDelivery: true
      TopicArn: !Ref MsgPrioTopic
      FilterPolicy:
        priority: 
          - LOW

EventBridge has far more integrations with AWS services, so why SNS?

Well, EventBridge is a great service and in most cases, I would prefer using it over SNS. However, EventBridge is not always the silver bullet in comparison to SNS. What do AWS docs say about SNS comparing to EventBridge?

“Amazon SNS is recommended when you want to build an application that reacts to high throughput or low latency messages published by other applications”

So, EventBridge's typical latency is of around half a second while SNS has a latency of about 30msec. In case you want to decrease latency, this is the fact you should consider.

Verdict

Quite some time ago, I was stunned by a Pirelli advertisement with Carl Lewis. At first, I was like what a heck… and then it just made a lot more sense. That is basically what I had in mind when I was thinking about this solution.
Serverless is a great way to do a lot with minimal effort. You have virtually unlimited scalability power to process data. But you have to think also about the consequences that great power may have on your dependencies. Systems that cannot scale to those levels and might be quickly overwhelmed and potentially choked. So unlimited power can easily turn into an unlimited disaster.