SQS Queue usage for decoupling, ordering and retry

Published in

heycar

9 min readOct 6, 2021

Introduction

Messages queue is a powerful tool for asynchronous communication between services. Messages are sent to the queue by a Producer and stay in the queue until they are not consumed and processed by a Consumer. Amazon SQS — is a message queueing service from Amazon.

In this article we will cover:

How we improved our communication with 3rd party services by using a queue
What were the reasons to start using queues for our service
How queues can be used as a retry mechanism
How queues help in the decoupling of services

What will not be covered:

How to setup queue itself, we can’t describe that better than it’s done in Amazon docs

In the end, we will share our conclusions on queues usage and share which advantages and disadvantages they brought to the daily life of our developers.

Use case example and description without queue

There could be many examples of how queues are solving our problems. One good example use case is how our service interacts with a CRM system because it covers all the problems we faced.

*Simple diagram of interaction with CRM system*

As you can see it’s a pretty simple use case:

User orders a vehicle on heycar
For each purchase, heycar checkout-service creates a CRM ticket to track order flow and help Customer Support team assist the user
During user journey in vehicle purchase flow, we will send requests to update tickets on CRM to display the relevant status of the purchase

The technical part of it is pretty simple and common:

The user interacts with the heycar client app
client app interacts with backend service via REST API
backend service interacts with the REST API given by the CRM system provider
the entire communication is synchronous HTTP calls

Original solution problems details

Issues which we faced with the described solution and which we tried to predict and prevent

The first two and most important issues which we faced were:

CRM system API often was not available and order creation was failing at that time.
CRM system API response time often was too high and was growing exponentially under high load. This was quite crucial, cause a call to CRM was synchronous and was blocking request handling execution thread until response not received

Initially, we devised and analyzed three possible solutions to the issues we were having

Change CRM system.

This solution was rejected almost immediately cause no guarantee that the new CRM provider will be more stable and data migration could be quite expensive.

2. Spring Events

Spring Events are promising on first look. But when we dived deeper into that topic we discovered few disadvantages which were quite crucial in our case.

Failed events reprocess and investigation. In case when event processing failed usually they are lost and there is no way for us to investigate and retry them. Of course, It’s possible to implement some sort of failed events listener and save them to DB, which will help us to not lost failed events. But it brings additional non required overhead to the service and increases the cost of support and maintenance
Effect on service performance is complicated to predict. Connection between the number of events and memory usage is linear and in the end, we will end up in a situation when only vertical or horizontal scalability are possible solutions for us, which means more expenses on service infrastructure.
Scalability in the cloud. Async event publisher will be blocked until listeners will process that particular event, which can cause some scalability issues. Also cause Spring Events are part of the service itself, it’s not possible to scale them or service separately. In case if we need more computation power only for events we will need to add them for service itself, but having them in outstanding service will give us much more flexibility.

3. Introduce retry mechanism and create CRM ticket asynchronously

At first look this solution looks good cause by using simple Spring @Async annotation we wouldn’t block main thread execution, even with a retry mechanism, users will get a response on order creation pretty fast. Even if after multiple retries, ticket creation will fail order will be saved to DB, and by using that data we can create tickets.

Unfortunately, this solution has a bench of disadvantages:

Retry mechanism will create additional nonrequired load on our services
It’s hard to track failed ticket creation
Someone has to check service logs and DB manually and create tickets, which is not efficient
All scalability issues from the previous solution applied here also
Usage of @Async annotation or any other way of asynchronous function call has issues with decoupling and could cause dependencies madness in service components and in the end could affect: application context initialization and load time, code maintenance, the complexity of new features implementation.

Example of potential issue

With such approach application component which uses CRMClient.createCrmTicket() invoker will require additional dependencies on components which provides order data, vehicle data and etc. (For example repository classes, financing service web client and etc.)

Solution description with Queue

After some research and investigations, we figured out that by using Queues we can solve problems of our current solution and at the same time be protected from potential issues which could appear in previously described alternative solutions. Below you can see the components diagram for the solution with Queue usage.

As you can see on the diagram, from now on heycar-checkout service will not send request to CRM system instantly during the order creation. Instead, it will produce a lightweight message with the id of the created order to standalone message broker which will take care of messages management.

At the same time checkout service plays the role of message consumer and whenever message occurs in queue checkout service asynchronously consumes it and as a result of the message, processing sends HTTP request to CRM-system API, if message processing fails it will be returned to the queue for next retry if after configured number retries message processing is not successful, the message will be drop to dead letter queue for investigation, research and manual retry.

Our gains from such approach are next:

The main execution thread shouldn’t be blocked until CRM-system will response to the checkout-service request, all interaction with CRM is asynchronous from the main execution flow. Means no negative effect on user requests response time and computation resource are available again faster.
When the CRM system is not available or request to it fails for any other reason, the message will be returned to the queue and processed later. All of this will not create additional load to checkout-service cause everything will be maintained by SQS
Dead letter queue — messages couldn’t be lost, even after multiple failed retries cause they will be stored in DLQ for later manual processing.
If checkout-service is down for any reason all messages, which are waiting for processing, wouldn’t be lost as in case of events, cause they are stored in the standalone and independent queue.
Scalability — checkout-service and SQS can be scaled separately according to the needs of each service.
Fewer performance issues — checkout service will consume and process events only in case when free resources are available.
Ordering — if messages order are important we can simply enable FIFO order on queue

Implementation details

Let’s consider some of the implementation details for a simple SpringBoot based web application.

First, we will start with the minimal application configuration required for messaging

Update build.gradle file of your SpringBoot application with the next dependency (check here for up to date version )

These dependencies will provide a convenient way of interaction with AmazonSQS.

2. Now let’s add now let’s update the application context with the configuration required for the messages sending.

QueueMessagingTemplate — later will be used for messages sending

3. Now we will add the configuration required for message consuming and handling

AmazonSQSAsync — is a simple client which provides asynchronous access to Amazon SQS

QueueMessageHandler — Documentation perfectly describes this: “Provides most of the logic required to discover handler methods at startup, find a matching handler method at runtime for a given message and invoke it.“

SimpleMessageListenerContainer — this will allow us to create concurrent MessageConsumers for the specified listeners

Also, we need to annotate the configuration class with @EnableSqs annotation, it will bring in credentials provider, region provider and etc.

4. AWS settings for the SQS

Let’s provide AWS settings for the SQS client

SpringBoot configuration part

application.yml part for configuration properties AwsSettings

Now our application config is ready let’s move to the next step.

Message Producer and Consumer implementation details

First let’s add the QueueListener component, which will be responsible for message consuming part

@SqsListener(“\${aws.queueURL}”, deletionPolicy = SqsMessageDeletionPolicy.ON_SUCCESS) — this annotation subscribes our service as message consumer to aws.queueURL , here you can subscribe to more than one queue like in this example “\${aws.queueURL}”, “\${aws.queueURL1}” .

deletionPolicy — identifies when a message should be removed from the queue. In our case, if any exception is thrown during message processing, it will be returned to the queue for the next attempts.

ApproximateReceiveCount header is used to track how many times the message was consumed and how many times the service tried to process it. In this case, if the value of greater than the configured value we just dropping the message from the queuer, but this value can be used to identify if the message should be dropped into the dead letter queue.

SentTimestamp — used to identify if the message is expired or not.

message — message payload itself

messageType — message type value used to identify the type of event for which produced message and how we should handle consumed message

messageConsumer — service layer component which is responsible for executing corresponding message handlers basing on the message type, in our case inside of it happens call of CRM client methods

2. Now let’s add the message Producer part

MessageService — is the simple component that uses messagingTemplate, configured in the previous section, to send messages to the required queue.

MessageProvider — is the simple helper interface for better organization of service architecture, as you can see on the line 7 it will contain message type and message object itself(line 10)

MessageProvider example:

This is was the last part of the implementation and now it’s ready to use.

To produce a message for the created order just simply add the next call to order create service method messageService.send(TicketCreationMessageProvider(orderId))

Then that message will be consumed by QueueListener and processed by MessageConsumer

Possible improvements of described implementation

DeadLetterQueue

To not loose message processing of which was failed after configured amount of tries we can simply send message to dead letter queue by updating QueueListener with next

Where messagesService.sendToDeadLetterQueue is the function that produces messages for the dead letter queue.

DeadLetterQueue — can be simply configured for AmazonSQS, unfortunately, infrastructure configuration is not a part of this article

2. Messages ordering — if messages ordering is important AmazonSQS supports FIFO ordering for queues

Conclusions

After having described implementation for CRM communication number of lost tickets decreased extremely and we are not worried much anymore about the availability of the CRM system, cause we know that in case of failed messages would be processed again and in the worst case they will appear in DeadLetterQueue.
Also until now with such approach we didn’t have any performance issues, our API response time doesn’t depend on the availability of 3rd party service. Below on the chart, you can see CRM system response time(purple line) comparison with our API response time(blue line), as you can see CRM response time is much higher than our API response time and doesn’t make any effect on it. And this is what we wanted to achieve.

Last few words

We would recommend not bring such an approach to your implementation in the early stages of the project when you need to prototype quickly and bring the project to the market quickly, cause queue configuration, each message listeners implementation, writing tests and development with the local environment requires a bit more effort than usual, but afterwards when you have time for improvements this solves a lot of issues and removes a lot of pain points.

Thanks to:

the Commerce team for its support on this development
Guillermo Campelo, Florentina Ardean, Graham Hunter for reviewing this article

Thank you for reading😀 ,

Paruir Dadaian