Hacking microservices performance using AWS SQS

Tushar Bansal
Tata 1mg Technology
5 min readSep 11, 2020
Photo by Curtis MacNewton on Unsplash

Introduction

At 1mg using microservices architecture comes with a lot of benefits. Also, there are plenty of bottlenecks that affect performance, scalability, reliability while using a microservices architecture. And at the scale we are growing there is a need to make our infrastructure better.

Problem Statement

Communication between multiple microservices can be a bottleneck in time-consuming tasks. If one microservice instance is under heavy load, it chokes other microservice(s) while completing a task. So, microservices are also halted for that time, consuming resources unnecessarily and making the whole infrastructure slow, Affecting customer feedback eventually.

We faced the same problem in user communication. Whenever we had to communicate something to user (via SMS, Push, or Email). We started receiving canceled errors exponential to our growth, mainly during the high number of orders or cron.

Solution

Solution: Well, one solution is always available: Increase the number of instances and double the storage.
Problem: It comes with a cost which will gradually affect your business.

Solution: Using Asynchronous Programming.
Problem: It boosts the performance of your application but sometimes it’s not much reliable. (Though we use it for other cases)

So eventually we went with Amazon Simple Queue Service (AWS SQS). It comes with a lot of benefits, as it’s a fully managed, scalable messaging queuing service.

How SQS Works?

Benefits

  1. It’s very simple to decouple application components with AWS SQS so that they can run and fail independently.
  2. Makes our applications easily scaleable.
  3. Easy to share sensitive data between microservices using server-side encryption
  4. We use different types of Queues (FIFO and Standard) depending upon usage.

Bottlenecks

  1. The maximum message size is 256KB. If you want to use more than that you have to use compression or S3(up to 2GB) referenced by SQS Message.
  2. With pay per use pricing and data transfer charges, during high system scale, there can be a significant increase in overall pricing in the case of large and bulk messages.

Types of AWS SQS

  • Standard Queues: These can be used in many scenarios as long as the order of messages doesn’t matter to you. For example, sending copies of invoices to customers.
Standard Queue Messaging
  • FIFO Queues: These are designed for tasks where the order of process matters and tasks are critical.
FIFO Queue Messaging

Our Use Case

We use FIFO SQS for everyday communications, as we have to send communication to users in a sequence. For example, they should receive ‘Order Placed’ SMS before the next ETA update or any other update regarding the order. These are easily handled by SQS without making the system slow in case of heavy loads during the sale and other daily promotions.

We also use Standard SQS for Order Tracking, ETA, allocation to sync updates with user and vendor POS, as the sequence of updates doesn’t matter here.

Planning, Improvements, and Implementation

For starters, we had to decide which communications should be sent through SQS. Why not all, you might ask! Priority communications like OTP, payment reminders would have taken time to reach if sent to SQS first. Since, they aren’t sent in bulk there isn’t any significant impact in CPU or memory consumption.

We already had different HTTP/TCP/Pub-Sub endpoints for communications in our service and multiple other services already used them. Making changes in other services was not feasible. And each time if there is any logic change changing the same in multiple services would have been a hideous task. So what we did was used the existing endpoints which sent communications to SQS or Direct based upon priority.

There was a significant improvement in performance and very less cancelled errors. But there were still issues. Delay in some communications during daily promotional communication cron was a lot more than 15 mins, during that time also there were significant cancelled errors. Also other communications which we should have sent to SQS but couldn’t because there was delay of more than 15 mins most of the time.

Now what we did was analyze events again and calculated the max delay time there would be in SQS and delay time for each event there should be based upon the volume, time, and type of communication. We made multiple SQS this time based upon the priority. High (Direct), Medium(SQS A), Low(SQS B). SQS B consists of communications like daily promotions, vendor reports even if there was a delay time of 30 minutes it wouldn’t harm our business. SQS A consists of communications like order updates we and the max delay was of around 5 min, and at the last there were Direct High priority communications which weren’t sent to SQS.

So finally, Same microservice that handles communication publishes and subscribes to the same SQS. According to the priority, we also set a delay and number of messages to be subscribed in one go. And now even during the sale or peak time, our microservice handles communication easily.

Enough talking let’s kick out the boredom and do some implementation.

Some important points during implementation

  1. Don’t purge the message from the queue before the process is complete. Otherwise, it will be lost if there's an exception during the process.
  2. Set visibility timeout according to your application processing time and use case.
  3. Use Dead Latter Queue with FIFO AWS SQS. So messages don't bottle up if there is an issue with payload in the message.
  4. Always Use Encryption in the message that contains sensitive data. So that it is not visible to any other anonymous user or AWS.

How we used python to implement the same using Python

Things you’ll need

  1. Python 3
  2. Boto3
  3. AWS Account with Access Keys

Let’s Get Started

  • Open AWS SQS Console and create a new Simple SQS with default Configurations.
  • Replace SQS_REGION, AWS_SECRET_ACCESS_KEY, and AWS_ACCESS_KEY_ID with ones you created in Point 3 above.
  • My sample run program

If you need to keep receiving messages constantly in an application use receive_message() method in an infinite loop.
Moreover don’t forget to use Exception handling and additional parameters you can use. You can find them in AWS SQS Docs.

Conclusion:

Amazon SQS is one of the services that make our infrastructure Better Performing, Scalable, and Reliable. Which is required at the rate we are growing.

If you liked this blog, hit 👏 and share this article. Let other people also learn from this article.

--

--