AWS SQS Deep Dive

AWS SQS Deep Dive

Joud W. Awad
Metalab Tech
Published in
25 min readAug 24, 2024

--

Introduction

AWS offers a vast array of services, many of which have become essential components in almost every internet application. In today’s post, we’ll focus on Amazon SQS (Simple Queue Service), exploring how it works, the different types of SQS, the architecture behind each type, and the best patterns for using them effectively.

Amazon SQS is a queue-based service that acts as middleware between a producer and consumers (workers). In this setup, a producer sends a message to the queue, and the consumer (worker) retrieves and processes the message. Once the message is successfully processed, it is deleted from the queue.

SQS Flow

Producer Flow

There are numerous producers that can publish messages to Amazon SQS. These producers can be broadly categorized into two groups: AWS services and those utilizing the AWS SQS SDK through various programming languages.

AWS Services

Several AWS services can directly publish messages to SQS. Some of the most notable include:

  • Amazon SNS
  • AWS Lambda
  • Amazon S3
  • AWS IoT
  • Amazon EventBridge
  • AWS Step Functions
  • AWS CodePipeline
  • Amazon Elastic Transcoder

These are just a few of the well-known services that can interact directly with SQS. You can also combine multiple services to create more complex workflows. For example, you could configure CloudWatch Alarms to send notifications to an SNS topic, which can then route those notifications to an SQS queue. AWS provides a variety of options to control your data flow and system architecture.

AWS SDK

The AWS SDK is a comprehensive collection of tools, libraries, and documentation that allows developers to interact with AWS services using their preferred programming languages. It simplifies the integration of AWS services into applications by providing pre-built functions and classes for tasks such as sending data to S3, managing EC2 instances, interacting with SQS queues, and more.

The AWS SDK supports multiple programming languages. Below is a simple example in Node.js demonstrating how to publish a message to an SQS queue using the AWS SDK:

import { SendMessageCommand, SQSClient } from "@aws-sdk/client-sqs";

const client = new SQSClient({});
const SQS_QUEUE_URL = "queue_url";

export const main = async (sqsQueueUrl = SQS_QUEUE_URL) => {
const command = new SendMessageCommand({
QueueUrl: sqsQueueUrl,
MessageAttributes: {
Version: {
DataType: "Number",
StringValue: "1",
},
},
MessageBody:
"Information about current NY Times fiction bestseller for week of 12/11/2016.",
});

const response = await client.send(command);
return response;
};

Now that we have covered the various methods and services that can publish messages to SQS, let’s start exploring the SQS service it self.

SQS Types

Amazon SQS supports two types of queues: Standard Queues and FIFO Queues. To fully understand the trade-offs between these types, it’s important to examine how each one works and its underlying implementation. We will focus on the following aspects for each type:

  1. Architecture
  2. Message Duplication
  3. Message Ordering
  4. Throughput and Performance
  5. Concurrency
  6. Use Cases
  7. Limitations

SQS Standard Queue

Based on AWS documentation, SQS Standard Queue support, At-least-once delivery, and message ordering aren’t preserved

  • At-Least-Once Delivery — A message is delivered at least once, but occasionally more than one copy of a message is delivered.
  • Best-Effort Ordering — Occasionally, messages are delivered in an order different from which they were sent.

Architecture

To understand how the SQS Standard Queue works on a very high level let us examine this diagram

SQS standard queue overview

A producer will publish a message to the queue, multiple consumers will be computing to consume this message, once the message is processed by the consumer we remove this message from the queue.
AWS SQS SDK provides all the needed functionality for the consumer to handle the message, and delete it.

SQS messages can be consumed by many services the most famous service that integrates very well with SQS as a consumer is AWS Lambda functions, AWS Lambda function contains what is known as an event source mapping which is a Lambda resource that reads items from the stream and queue-based services and invokes a function with batches of record, thus it moves away the complicity of polling from the lambda (as polling is an anti-pattern in the serverless world).

Message Duplication

Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability as the following diagram shows:

Redundancy in SQS Standard Queue

On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message.

If this occurs, the copy of the message isn’t deleted on the server that is unavailable, and you might get that message copy again when you receive messages. Design your applications to be idempotent (they should not be affected adversely when processing the same message more than once).

Every time you receive a message from a queue, you receive a receipt handle for that message. This handle is associated with the action of receiving the message, not with the message itself. To delete the message or to change the message visibility, you must provide the receipt handle (not the message ID). Thus, you must always receive a message before you can delete it (you can’t put a message into the queue and then recall it).

If you receive a message more than once, each time you receive it, you get a different receipt handle. You must provide the most recently received receipt handle when you request to delete the message (otherwise, the message might not be deleted).

Message Ordering

Occasionally, messages are delivered in an order different from which they were sent.

From the previous Diagram, we see that message 4 is coming before message 3 even though message 3 has been published before it, thus your consumers need to be aware of this limitation and have to handle the case where ordering is important.

For example, if your queue holds the status of an order in an E-Commerce shop your consumers need to implement a logic that handles situations where order status PAID consumed before the order status PROCESSING this is very dependable on how your system works and what logic is needed for each of these statuses, also SQS provides FIFO as an alternative solution to this problem.

Throughput and Performance

AWS claims that SQS Standard Queue supports Unlimited Throughput — Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).

Concurrency

Concurrency is a big important peace of a Standard queue, you have to make sure that you select the right concurrency settings, otherwise, your queue will have tons of lambda consumers fighting over its messages.

The following screen shows the settings that you can specify when selecting SQS as a trigger for the lambda function.

SQS as a trigger for lambda

Let us go over these settings one by one

  • Batch Size: it specifies how many records you will send per batch to a consumer, for example, if you have set this up to 10, then every time a lambda function polls data from the queue you will consume and process 10 messages if there are that number in the queue.
  • Batch Window: Another way to trigger your consumer is by time window, you specify the maximum time you gather messages before it is polled from the consumer, if the batch size is met then the Batch window is ignored.
    For example, you have put the Batch Size = 10 & Batch Window = 5 , if you get 10 or more messages in less than 5 seconds then your consumer will poll the data from the queue, but if you received only 3 messages but you passed a 5 seconds duration your consumer will poll the messages as well.
  • Maximum Concurrency: this is a very critical option that you have to specify carefully especially in a standard queue as it affects how your messages are processed, specifying this to a high number may cause your consumers to compete over messages and also to consume your lambda concurrent limit per region in your account.

Use Cases

Send data between applications when the throughput is important, Standard message queues are suitable for various scenarios, as long as your application can handle messages that might arrive more than once or out of order. Examples include:

  • Decoupling live user requests from intensive background work — Users can upload media while the system resizes or encodes it in the background.
  • Allocating tasks to multiple worker nodes — For example, handling a high volume of credit card validation requests.
  • Batching messages for future processing — Scheduling multiple entries to be added to a database at a later time.

Limitations

After we have seen various aspects of SQS Standard queue, it is time to understand some limitations around it:

  • QuotaDescriptionDelay queue: The default (minimum) delay for a queue is 0 seconds. The maximum is 15 minutes.
  • Long polling wait time: The maximum long polling wait time is 20 seconds.
  • Messages per queue (backlog): The number of messages that an Amazon SQS queue can store is unlimited.
  • Messages per queue (in flight): For most standard queues (depending on queue traffic and message backlog), there can be a maximum of approximately 120,000 in flight messages (received from a queue by a consumer, but not yet deleted from the queue). If this quota is reached:
    - When using short polling, Amazon SQS returns the OverLimit error message.
    - When using long polling, Amazon SQS returns no error message.
    To avoid reaching the quota, delete messages from the queue after they’re processed. You can also increase the number of queues you use to process your messages. To request a quota increase, submit a support request.

SQS FIFO

FIFO Queue is known as First-in-first-out delivery, message ordering is preserved.

  • Exactly-Once Processing — A message is delivered once and remains available until a consumer processes and deletes it. Duplicates aren’t introduced into the queue.
  • First-In-First-Out Delivery — The order in which messages are sent and received is strictly preserved.

Architecture

To understand the architecture of the FIFO queue we provide the following diagram

SQS FIFO queue overview

SQS stores FIFO queue data in partitions. A partition is an allocation of storage for a queue that is automatically replicated across multiple Availability Zones within an AWS Region. You don’t manage partitions. Instead, AWS SQS handles partition management.

For FIFO queues, Amazon SQS modifies the number of partitions in a queue in the following situations:

  • If the current request rate approaches or exceeds what the existing partitions can support, additional partitions are allocated until the queue reaches the regional quota.
  • If the current partitions have low utilization, the number of partitions may be reduced.

Partition management occurs automatically in the background and is transparent to your applications. Your queue and messages are available at all times.

Message Routing

When adding a message to a FIFO queue, Amazon SQS uses the value of each message’s message group ID as input to an internal hash function. The output value from the hash function determines which partition stores the message.

The above diagram shows a queue that spans multiple partitions. The queue’s message group ID is based on item number. Amazon SQS uses its hash function to determine where to store a new item; in this case, it’s based on the hash value of the field MessageGroupID. Note that the items are stored in the same order in which they are added to the queue. Each item's location is determined by the hash value of its message group ID.

Amazon SQS is optimized for uniform distribution of items across a FIFO queue’s partitions, regardless of the number of partitions. AWS recommends that you use message group IDs that can have a large number of distinct values.

Message Duplication

Unlike standard queues, FIFO queues don’t introduce duplicate messages. FIFO queues help you avoid sending duplicates to a queue. If you retry the SendMessage action within the 5-minute deduplication interval, SQS doesn't introduce any duplicates into the queue.

To configure deduplication, you have two options:

  • Enable content-based deduplication: This instructs Amazon SQS to use a SHA-256 hash to generate the message deduplication ID using the body of the message — but not the attributes of the message, this means that messages with the same content will be acknowledged as duplication.
  • Explicitly provide the message deduplication ID: in this case producer assigns a message ID to each message independently and FIFO uses the message ID to detect duplications.

In addition to that you can go further with configuration of your duplication settings, SQS FIFO allows you to select one of two “Deduplication scopes”:

  1. Queue Level: Deduplication logic will apply to all messages in the queue regardless of which GroupID a message belongs to, this option is only available in a non “High throughput FIFO queue”.
  2. Message group Level: the deduplication logic will apply to the GroupID level, meaning that if you have a duplicated message in different groupIDs FIFO queue will not automatically detect that for you, this is the default option for “High throughput FIFO queue”
Duplication Scope Per FIFO Queue

Message Ordering

Messages ordering are guaranteed in FIFO queue, however the guarantee is at the GroupID level and not the queue level, which means that as long as messages are in the same GroupID they will be processed in the same order they were inserted in that GroupID.

AWS FIFO achieves this by assigning a single consumer per GroupID, we discuss this in more depth in the “Concurrency” section

Throughput and Performance

SQS FIFO supports High Throughput If you use batching, FIFO queues support up to 3,000 messages per second, per API method (SendMessageBatch, ReceiveMessage, or DeleteMessageBatch). The 3,000 messages per second represent 300 API calls, each with a batch of 10 messages. To request a quota increase, submit a support request. Without batching, FIFO queues support up to 300 API calls per second, per API method (SendMessage, ReceiveMessage, or DeleteMessage).

Keep in mind that SQS allows you to specify throughput on two levels:

  1. Queue Level
  2. Message group Level: used for the “High throughput FIFO queue”
Throughput Limit Scope Of FIFO Queue

Concurrency

Concurrency with FIFO queue is a big topic that we will try to break down into small pieces to understand how a consumer (lambda) will work with the FIFO behavior.

By using a FIFO (First-In-First-Out) queue with Lambda, you can ensure ordered processing of messages within each message group. The Lambda function will not run multiple instances for the same message group simultaneously, thereby maintaining the order. However, it can scale up to handle multiple message groups in parallel, ensuring efficient processing of your queue’s workload. The following points describe the behavior of Lambda functions when processing messages from an Amazon SQS FIFO queue with respect to message group IDs:

  • Single instance per message group: At any point in time, only one Lambda instance will be processing messages from a specific message group ID. This ensures that messages within the same group are processed in order, maintaining the integrity of the FIFO sequence.
  • Concurrent processing of different groups: Lambda can concurrently process messages from different message group IDs using multiple instances. This means that while one instance of the Lambda function is handling messages from one message group ID, other instances can simultaneously handle messages from other message group IDs, leveraging the concurrency capabilities of Lambda to process multiple groups in parallel.

For example, suppose your FIFO queue receives messages with the same message group ID, and your Lambda function has a high concurrency limit (up to 1000).

If a message from group ID ‘A’ is being processed and another message from group ID ‘A’ arrives, the second message will not trigger a new Lambda instance until the first message is fully processed.

However, if messages from group IDs ‘A’ and ‘B’ arrive, both messages can be processed concurrently by separate Lambda instances.

Please Refer to the following Diagram for more information about concurrency

Concurrency At FIFO Queue

Use Cases

Send data between applications when the order of events is important, for example:

  1. E-commerce order management system where order is critical
  2. Integrating with a third-party systems where events need to be processed in order
  3. Processing user-entered inputs in the order entered
  4. Communications and networking — Sending and receiving data and information in the same order
  5. Computer systems — Making sure that user-entered commands are run in the right order
  6. Educational institutes — Preventing a student from enrolling in a course before registering for an account
  7. Online ticketing system — Where tickets are distributed on a first come first serve basis

Limitations

  • Delay queue: The default (minimum) delay for a queue is 0 seconds. The maximum is 15 minutes.
  • Long polling wait time: The maximum long polling wait time is 20 seconds.
  • Message groups: There is no quota to the number of message groups within a FIFO queue.
  • Messages per queue (backlog): The number of messages that an Amazon SQS queue can store is unlimited.
  • Messages per queue (in flight): For FIFO queues, there can be a maximum of 20,000 in flight messages (received from a queue by a consumer, but not yet deleted from the queue). If you reach this quota, Amazon SQS returns no error messages.

High throughput FIFO queue

At the end of the FIFO section I wanted to highlight a very important feature of FIFO queue called “High throughput FIFO queues”, when this option is activated SQS efficiently manages high message throughput while maintaining strict message order, ensuring reliability and scalability for applications processing numerous messages. This solution is ideal for scenarios demanding both high throughput and ordered message delivery.

Amazon SQS high throughput FIFO queues are not necessary in scenarios where strict message ordering is not crucial and where the volume of incoming messages is relatively low or sporadic. For instance, if you have a small-scale application that processes infrequent or non-sequential messages, the added complexity and cost associated with high throughput FIFO queues may not be justified. Additionally, if your application does not require the enhanced throughput capabilities provided by high throughput FIFO queues, opting for a standard Amazon SQS queue might be more cost-effective and simpler to manage.

You can configure the “High throughput” option from the console of the SQS

Configuring High throughput FIFO queue

Currently, there is no unified number for how much a single FIFO queue can handle in a High throughput as it differs from one region to another however we recommend taking a look at the AWS documentation for more information.

SQS Configuration

Let us recap where we are right now, we have understood what are the available publishers of SQS, understand the types of SQS that are offered by AWS, and now we want to go into a deep dive into the shared configurations between any type of SQS.

First, we want to understand the default configuration available to SQS, for that we are providing the following image

SQS Configuration Section

going through these configurations:

  • Visibility timeout: Visibility timeout sets the length of time that a message received from a queue (by one consumer) will not be visible to the other message consumers, for example if you are processing a large video you would need to set this value to a higher number that allows your consumer to process the record before another consumer can read the same message.
  • Delivery delay: If your consumers need additional time to process messages, you can delay each new message coming to the queue. The delivery delay is the amount of time to delay the first delivery of each message added to the queue. Any messages that you send to the queue remain invisible to consumers for the duration of the delay period. The default (minimum) delay for a queue is 0 seconds. The maximum is 15 minutes.

For standard queues, the per-queue delay setting is not retroactive — changing the setting doesn’t affect the delay of messages already in the queue.

For FIFO queues, the per-queue delay setting is retroactive — changing the setting affects the delay of messages already in the queue.

  • Receive message wait time: The receive message wait time is the maximum amount of time that polling will wait for messages to become available to receive. The minimum value is zero seconds and the maximum value is 20 seconds.
    Long polling helps reduce the cost of using Amazon SQS by eliminating the number of empty responses (when there are no messages available for a ReceiveMessage request) and false empty responses (when messages are available but aren’t included in a response). If a receive request collects the maximum number of messages, it returns immediately. It does not wait for the polling to time out.
    If you set the receive message wait time to zero, the receive requests use short polling.
    Amazon SQS offers short and long polling options for receiving messages from a queue. Consider your application’s requirements for responsiveness and cost efficiency when choosing between these two polling options:

Short polling (default) — The ReceiveMessage request queries a subset of servers (based on a weighted random distribution) to find available messages and sends an immediate response, even if no messages are found.

Long pollingReceiveMessage queries all servers for messages, sending a response once at least one message is available, up to the specified maximum. An empty response is sent only if the polling wait time expires. This option can reduce the number of empty responses and potentially lower costs.

  • Message retention period: The message retention period is the amount of time that Amazon SQS retains a message that does not get deleted. Amazon SQS automatically deletes messages that have been in a queue for more than the maximum message retention period.
  • Maximum message size: The maximum message size for your queue.

SQS Encryption

Encryption is a fundamental aspect of all AWS services. When discussing encryption, it is important to distinguish between two primary types:

  • Encryption at Rest: Data is encrypted where it is stored.
  • Encryption in Transit: Data is encrypted while it is being transmitted.

AWS SQS provides Encryption in Transit by default, requiring no additional configuration. For Encryption at Rest, however, Amazon SQS allows you to enable or disable this feature. It is important to note that enabling encryption at rest may incur additional costs.

If you choose to enable encryption at rest, AWS SQS offers two options:

  1. Amazon SQS-managed keys (SSE-SQS): In this case, AWS automatically creates and manages the encryption keys on your behalf, requiring no manual intervention.
  2. AWS Key Management Service (SSE-KMS): This option allows you to manage your own encryption keys through the AWS KMS service, offering greater control over the encryption process, including key rotation.

The image below provides a visual overview of these encryption options.

Encryption Options

SQS Dead-Letter queue (DLQ)

In this section, I want to highlight the Dead-Letter Queue (DLQ) pattern used with Amazon SQS, particularly in conjunction with redrive policies. To begin, let’s clarify what a DLQ is.

Dead-letter queues (DLQ) exist alongside regular message queues. They act as temporary storage for erroneous and failed messages. DLQs prevent the source queue from overflowing with unprocessed messages.

For example, consider a software that has a regular message queue and a DLQ. The software uses the regular queue to hold messages it plans to send to a destination. If the receiver fails to respond or process the sent messages, the software moves them to the dead-letter queue.

There are two potential causes when messages move to the DLQ pipeline: erroneous message content and changes in the receiver’s system.

Erroneous message content

A message will move to a DLQ if the transmitted message is erroneous. Hardware, software, and network conditions might corrupt the sent data. For example, hardware interference slightly changes some of the information during transmission. The unexpected data corruption could cause the receiver to reject or ignore the message.

Changes in the receiver’s system

A message might also move to a DLQ if the receiving software has gone through changes that the sender is not aware of. For example, you could attempt to update a customer’s information by sending a message for CUST_ID_005. However, the receiver could fail to process the incoming message because it’s removed the customer from the system’s database.

How does a dead-letter queue work?

For the most part, a dead-letter queue (DLQ) works like a regular message queue. It stores erroneous messages until you process them to investigate the reason for the error.

Next, we discuss the redrive policy for DLQs and how messages move in and out of DLQs.

Software moves messages to a dead-letter queue by referring to the redrive policy. The redrive policy consists of rules that determine when the software should move messages into the dead-letter queue. Mainly by defining the maximum retry count, the redrive policy regulates how the source queue and dead-letter queue interact with each other.

For example, if your developer sets the maximum retry count to one, the system moves all unsuccessful deliveries to the DLQ after a single attempt. Some failed deliveries may be caused by temporary network overload or software issues. This sends many undelivered messages to the DLQ. To get the right balance, developers optimize the maximum retry count to ensure the software performs enough retries before moving messages to the DLQ.

DLQ Flow

When should you use a dead-letter queue?

You can use a dead-letter queue (DLQ) if your system has the following issues.

Unordered queues

You can take advantage of DLQs when your applications don’t depend on ordering. While DLQs help you troubleshoot incorrect message transmission operations, you should continue to monitor your queues and resend failed messages.

FIFO queues

Message ordering is important in first-in, first-out (FIFO) queues. Every message must be processed before delivering the next message. You can use dead-letter queues with FIFO queues, but your DLQ implementation should be FIFO as well.

When should you NOT use a dead-letter queue?

You shouldn’t use a dead-letter queue (DLQ) with unordered queues when you want to be able to keep retrying the transmission of a message indefinitely. For example, don’t use a dead-letter queue if your program must wait for a dependent process to become active or available.

Similarly, you shouldn’t use a dead-letter queue with a first-in, first-out (FIFO) queue if you don’t want to break the exact order of messages or operations. For example, don’t use a dead-letter queue with instructions in an edit decision list (EDL) for a video editing suite. In this instance, by changing the order of edits, you change the context of subsequent edits.

SQS Message

Messages are the main component of SQS, a producer publishes a message to the queue, and the SQS maintains the message until a consumer reads, processes, and deletes this message, let us go over a few different components of the message.

As we have discussed before there are two types of SQS, and each one of them contains a separate interface to send a message to it, keeping in mind that they share the same attributes at some level, let us explore an example of a message and then talk about each attribute usage

{
// Remove DelaySeconds parameter and value for FIFO queues
DelaySeconds: 10,
MessageAttributes: {
Title: {
DataType: "String",
StringValue: "The Whistler",
},
Author: {
DataType: "String",
StringValue: "John Grisham",
},
WeeksOn: {
DataType: "Number",
StringValue: "6",
},
Version: {
DataType: "Number",
StringValue: "1"
}
},
MessageSystemAttributes: {
AWSTraceHeader: {
DataType: "String",
StringValue: "Root=1-5e8f59f6-abc123def456"
},
},
MessageBody:
"Information about current NY Times fiction bestseller for week of 12/11/2016.",
MessageDeduplicationId: "TheWhistler", // Required for FIFO queues
MessageGroupId: "Group1", // Required for FIFO queues
QueueUrl: "SQS_QUEUE_URL",
};
  • DelaySeconds: The length of time, in seconds, for which to delay a specific message.
  • MessageAttributes: Each message attribute consists of a Name, Type, and Value , these values can be read by the consumer and also can be used to pass information, for example, a common example is to use a “Version” key that represents the version of the message in order for the schema changes to take effect.
  • MessageSystemAttributes: The message system attribute to send for the AWS Services. Each message system attribute consists of a Name, Type, and Value. we talk more about this in the next section
  • MessageBody: The message to send, must be a string value, if you are sending a JSON object make sure to stringify it before sending it.
  • MessageDeduplicationId: a unique identifier for the message we are sending, this is used in FIFO queues only in order to acknowledge duplicated messages.
  • MessageGroupId: used in FIFO queues only, this is the key that is used by the hash function to route the message to the right partition.
  • QueueUrl: The URL of the Amazon SQS queue to which a message is sent.

Message metadata

You can use message attributes to attach custom metadata to Amazon SQS messages for your applications. You can use message system attributes to store metadata for other AWS services, such as AWS X-Ray.

Amazon SQS message attributes

Amazon SQS lets you include structured metadata (such as timestamps, geospatial data, signatures, and identifiers) with messages using message attributes. Each message can have up to 10 attributes. Message attributes are optional and separate from the message body (however, they are sent alongside it). Your consumer can use message attributes to handle a message in a particular way without having to process the message body first.

Message System Attributes

Whereas you can use message attributes to attach custom metadata to Amazon SQS messages for your applications, you can use message system attributes to store metadata for other AWS services, such as AWS X-Ray.

Message system attributes are structured exactly like message attributes, with the following exceptions:

  • Currently, the only supported message system attribute is AWSTraceHeader. Its type must be String and its value must be a correctly formatted AWS X-Ray trace header string.
  • The size of a message system attribute doesn’t count towards the total size of a message.

All components of a message attribute are included in the 256 KB message size restriction.

The Name, Type, Value, and the message body must not be empty or null.

SQS Patterns & Best Practices

Managing large messages

As mentioned previously the max message size that we can send is 256KB you can use the Amazon SQS Extended Client Library for Java and the Amazon SQS Extended Client Library for Python to send large messages. This is especially useful for consuming large message payloads, from 256 KB and up to 2 GB. Both libraries save the message payload to an Amazon Simple Storage Service bucket and send the reference of the stored Amazon S3 object to the Amazon SQS queue.

However, for other programming languages you can still achieve the same behavior but leveraging the AWS Services instead of using the libraries.

Event-based approach for large messages

Error handling and problematic messages

To capture all messages that can’t be processed, and to collect accurate CloudWatch metrics, configure a dead-letter queue, we have already had a dedicated section for it previously.

  • The redrive policy redirects messages to a dead-letter queue after the source queue fails to process a message a specified number of times.
  • Using a dead-letter queue decreases the number of messages and reduces the possibility of exposing you to poison pill messages (messages that are received but can’t be processed).
  • Including a poison pill message in a queue can distort the ApproximateAgeOfOldestMessage CloudWatch metric by giving an incorrect age of the poison pill message. Configuring a dead-letter queue helps avoid false alarms when using this metric.

Message processing and timing

Setting the visibility timeout depends on how long it takes your application to process and delete a message. For example, if your application requires 10 seconds to process a message and you set the visibility timeout to 15 minutes, you must wait for a relatively long time to attempt to process the message again if the previous processing attempt fails. Alternatively, if your application requires 10 seconds to process a message but you set the visibility timeout to only 2 seconds, a duplicate message is received by another consumer while the original consumer is still working on the message.

To make sure that there is sufficient time to process messages, use one of the following strategies:

  • If you know (or can reasonably estimate) how long it takes to process a message, extend the message’s visibility timeout to the maximum time it takes to process and delete the message. For more information, see Configuring the Visibility Timeout.
  • If you don’t know how long it takes to process a message, create a heartbeat for your consumer process: Specify the initial visibility timeout (for example, 2 minutes) and then — as long as your consumer still works on the message — keep extending the visibility timeout by 2 minutes every minute.

Consumer Flow

The final component of our system is the consumer flow. While we covered many related topics in the previous section (SQS Deep Dive), there are a few additional points worth discussing before we conclude this post.

Today, most applications use AWS Lambda as the primary method for consuming messages from SQS due to it is serverless nature. However, it’s important to note that there are other ways to configure SQS consumers, particularly for scenarios where message processing may take longer than the 15-minute maximum runtime allowed by Lambda.

While Lambda simplifies the consumption of SQS messages through Event Source Mapping, alternative consumers like EC2, ECS, or any application that supports the AWS SQS SDK can be configured to consume messages from SQS. This approach is especially useful for processing tasks that require extended time, such as video processing or large data transformations.

In a recent project, I utilized a JavaScript library to poll data from SQS within an application running on EC2. This allowed for greater control and flexibility in handling long-running processes.

Another architecture worth mentioning is the use of SQS in conjunction with ECS Fargate. This setup enables you to consume messages from SQS at scale, providing a robust solution for applications that require extensive processing power.

ECS Fargate Architecture For Handling SQS

Consumer Message Deleting

It is crucial to note that once a consumer finishes processing a message, it must use the SDK to explicitly delete the message from the queue. Failing to do so can result in the message being processed again by another consumer or remaining in the queue until its retention period expires.

In-Flight Messages

Another important topic to understand is the in-flight messages.

An Amazon SQS message has three basic states:

  1. Sent to a queue by a producer.
  2. Received from the queue by a consumer.
  3. Deleted from the queue.

A message is considered to be stored after it is sent to a queue by a producer, but not yet received from the queue by a consumer (that is, between states 1 and 2). There is no quota to the number of stored messages. A message is considered to be in flight after it is received from a queue by a consumer, but not yet deleted from the queue (that is, between states 2 and 3). There is a quota to the number of in-flight messages.

In-flight messages

For most standard queues (depending on queue traffic and message backlog), there can be a maximum of approximately 120,000 in flight messages (received from a queue by a consumer, but not yet deleted from the queue). If you reach this quota while using short polling, Amazon SQS returns the OverLimit error message. If you use long polling, Amazon SQS returns no error messages. To avoid reaching the quota, you should delete messages from the queue after they're processed.

For FIFO queues, there can be a maximum of 20,000 in flight messages (received from a queue by a consumer, but not yet deleted from the queue). If you reach this quota, Amazon SQS returns no error messages.

When working with FIFO queues, DeleteMessage operations will fail if the request is received outside of the visibility timeout window. If the visibility timeout is 0 seconds, the message must be deleted within the same millisecond it was sent, or it is considered abandoned. This can cause Amazon SQS to include duplicate messages in the same response to a ReceiveMessage operation if the MaxNumberOfMessages parameter is greater than 1.

Conclusion

I hope this post has provided you with a solid understanding of the key aspects of the AWS SQS service. While there is much more to explore, we have covered the essential topics that you need to be familiar with before using SQS. Remember, it’s crucial to regularly consult the official AWS documentation as your primary resource, as it is continuously updated with the latest information.

Follow Me For More Content

If you made it this far and you want to receive more content similar to this make sure to follow me on Medium and on Linkedin

--

--

Joud W. Awad
Metalab Tech

Experienced Software Engineer, Solutions Architect with 9+ years in backend/frontend development, mobile apps, and DevOps