Adopting an Event-Driven Architecture: A comparative look at AWS messaging solutions — SQS

Published in

SSENSE-TECH

11 min readJun 25, 2021

At SSENSE, we have adopted microservices and have been leveraging an event-driven architecture for some time now. A key component of such architecture is the technology — or technologies — chosen to transport the messages generated so the microservices can communicate with each other.

If we consider AWS alone, there are several managed solutions that can be used. But are they all the same? The answer is obviously no, but choosing the best for your use case can be challenging.

This article is the first of a series that aims at exposing some of the several solutions available within AWS.

The structure will be as follows:

Part I — Simple Queue Service (SQS)

Part II — Simple Notification Service (SNS)

Part III — EventBridge

Part IV — Kinesis and comparing all the technologies

AWS supports managed versions of other solutions, such as RabbitMQ, ActiveMQ, and Kafka. I will not cover them because those are already message brokers and streaming services that exist outside the Amazon ecosystem with good documentation and well-established use cases.

Queue: Hello, Old Friend

Using a queue is perhaps the simplest form of decoupling between two applications — or two parts of the same application. A queue is a point-to-point connection that allows the one participant, referred to as the producer, to add messages to the queue, and another participant, the consumer, that will access the messages in the queue, one by one.

Figure 1. Point-to-Point communication using a queue.

Figure 1 illustrates the asynchronous communication, where the producer does not depend on an acknowledgement from the consumer to continue to operate independently, and where the consumer can process messages at its own pace, or even stop, without impacting the producer.

The common expectation for a queue is that it operates in a First In, First Out (FIFO) mode, where the oldest message added by the producer will be the first one retrieved by the consumer, and in the same order they were added to the queue.

Figure 2. Message ordering in a FIFO queue. The oldest messages read first by the consumer.

Due to their point-to-point nature, queues are unidirectional. That said, if you need to establish bidirectional communication between the two participants, you will require two queues as seen in Figure 3.

Figure 3. Second queue establishing a reply channel.

With this basic definition let’s see how SQS can help us.

Simple Queue Service

Basics

Amazon SQS, like the name indicates, provides a simple to use, powerful, and scalable queuing solution.

At its core, to send a message you have to understand just a few concepts:

Queue

The construct that will hold the messages. When sending or receiving messages you will specify the QueueUrl that is available when a queue is created.

MessageBody

What you want to share from the producer to the consumer. SQS does not care what the content of the message is, nor provides any sort of validation or format to it.

That’s it. The code below illustrates a simple example.

As shown in the previous example, complex structured data should be transformed into a string representation in order to be sent.

Once the message is acknowledged by AWS you have the guarantee that it has been persisted so the consumer will be able to access it.

For the consumer to receive it you simply have to use the receiveMessage call, like the one in the snippet below.

Those who execute the above code will see that you can call it multiple times and it will return the same message. To avoid this behavior, we will update the code.

The highlighted code indicates that we have to delete the message from the queue to prevent the next receiveMessage call to access it.

To understand why, let’s look at the basic architecture of SQS and explore more details.

Basic Architecture

When the producers successfully send a message, it gets persisted into multiple servers for redundancy.

Figure 4. When a message is successfully sent it is replicated in several different servers.

When requesting to receive a message, the consumer will receive it from one of those servers, and that starts a visibility timeout counter.

This counter, which by default is 30 seconds, is used to hide this message from other consumers. This gives time for the consumer to process the message, and to inform SQS that it no longer needs to hold it by calling the deleteMessage.

Figure 5. Receiving a message hides it (for a while) but does not automatically delete it.

For those used to working with traditional message brokers such as RabbitMQ, this is equivalent to the acknowledgment of receiving a message. SQS, contrary to those brokers, does not have a way to automatically acknowledge the message, delegating this responsibility to the consumer.

The visibility timeout is very handy on many occasions, such as with the case when you have multiple consumers potentially running in parallel. Without the visibility timeout, you would run the risk of those concurrent consumers seeing the same message more than one time.

Figure 6. Multiple consumers, each receive one message due to the visibility timeout.

However, be aware that you are not out of the woods when it comes to handling/preventing duplicate messages from being processed.

Figure 7. If the consumer fails to delete the message it will be available and consumed again.

Message Lifecycle

From the moment a message is persisted in a queue, it follows a life cycle that is influenced by how you configure the queue.

Delayed Delivery specifies a delay that will be applied to the message’s visibility once it’s added, making it unavailable to any consumer that requests new messages from the queue.

Figure 8. A delayed delivery hides the message from all consumers from the moment it gets added to the queue.

Visibility Timeout: After a message has been provided to a consumer it becomes invisible to any other consumer until this time expires. It can be up to 15 minutes. This is similar to the Delayed Delivery option, but only takes place once the message is consumed.

Message Retention Period: The maximum duration a message can remain in a queue. The default value is 4 days, with a minimum of 1 minute and a maximum of 14 days. After this time expires, the message is removed from the queue. If there is a dead-letter queue configured it will be moved to that queue.

Maximum Receive Count: How many times the message can be delivered to a consumer before being automatically removed from the queue. If there is a dead-letter queue configured the message will be moved to it at the n+1 attempt.

Dead-letter Queue: It is another queue that can be used to direct messages that fail to be processed, either because they exceeded the maximum number of attempts to read them or because they exceeded the maximum duration in the queue.

This is valid for the standard and FIFO types of queues, we will cover the differences in the next section. See the full list here.

Standard

The default type of queue, also known as Standard, offers a high throughput option but comes at the expense of features that may be considered sacred for those used to traditional message brokers:

Ordering guarantee

The standard type does not guarantee the strict ordering of messages. This means that it is possible, under load, that the n+1 message is received before the message n.

If your application is sensitive to the order you will have to implement a solution internally, switch to the FIFO type or choose a different messaging solution.

Exactly-once processing

As we have seen in the basic architecture section, if there is a problem with the consumer and it fails to remove the message, or takes more than the visibilityTimeout, you will have to deal with the same message being available/delivered again.

While no SQS queue type prevents this, there is another type of problem that is not solved by the Standard type, which is the lack of an idempotent producer.

Figure 9. A failure in the producer can lead to duplication of the message in the queue.

If your producer tries to send a message and the connection drops before it receives the reply from SQS, it will think it failed, while the message indeed was sent and is available for the consumers. In this case, the producer would try to send again resulting in the duplication of the message.

That is the reason why Standard queues provide at least once processing guarantee. It is entirely up to your application code to handle duplication.

One feature only available in this type of queue is the capability to control the delay of the delivery on a per-message basis.

FIFO

FIFO queues are the solution when your application is sensitive to the ordering and duplication of messages associated with Standard types. It comes down, with a reduced throughput of 300 operations per second.

Before going into details of the exactly-once processing, I would like to clarify what throughput means in this context. Contrary to what you would expect, we do not have 300 messages per second, but instead 300 API calls per second. This means that SendMessage, ReceiveMessage, and DeleteMessage all compete with each other and count towards this limit. This should be factored in when deciding to use the FIFO type.

I will cover a way to improve the throughput for FIFO queues in the Increasing Throughput with FIFO Queues section.

The exactly-once processing guarantee works like this: When sending a message, you provide a value for the MessageDeduplicationId parameter or let SQS auto-calculate one for you based on the message body. This value will be saved by the queue and compared with each new attempt to send a message. If it already exists, the message will be accepted by the queue but not persisted! This is true even if the first message has already been retrieved and deleted.

Figure 10. An idempotent producer with FIFO not duplicating the message.

While convenient, it is important to stress the fact that this will be true only for a 5 minute window. If the producer attempts to send the same message after this window it is possible that the message will be delivered twice.

Increasing Throughput with FIFO Queues

Imagine you need to have strict ordering, exactly-once processing and decide to use a FIFO queue. During the planning process, you realize the 300 API calls per second will present a problem as you will produce more messages than this can accommodate.

A built-in supported solution for this limitation is in the form of batching operations. SQS offers you the support to send, receive and delete messages in batches of up to 10 messages. This can increase the throughput, at the expense of latency, to an upper limit of 3,000 transactions per second.

To increase it even further, you can leverage the recently released support of high throughput FIFO queues. With this option enabled you can have up to 300 requests per second per message group. So if your application can select a message group with enough cardinality you can have a higher throughput. The real capacity is based on how AWS partitions the queue, seen here.

Reducing Latency with Competing Consumers and FIFO Queues

Imagine your application is sensitive to the ordering, and the processing it requires for each message takes a long time. This would cause an increase in latency as the newer messages sent by the producer would be waiting for a long time before being received by the consumer.

A simple solution would be to have multiple consumers, each receiving messages from the same queue.

Unfortunately, increasing the number of consumers alone would also violate the ordering guarantee on the processing side. Anyone who has suffered from this probably knows the concept of consistent hashing. With Standard queues, you would have to pick the criteria to hash the messages — — for example from the same customer, or same product — — and assign one hash per queue.

Figure 11. Reducing latency with standard queues, consistent hashing, and competing consumers.

With FIFO queues you have a feature similar to that already built-in for you in the form of the Message Group. Each message sent to a FIFO queue must have a MessageGroupId set. This value will be used to guarantee, per Message Group that the ordering is respected when delivering the messages to consumers. Therefore, if you select criteria for the message group you can scale the number of consumers and still have each consume the message from each group in the same order they were added.

Figure 12. Competing consumers with FIFO queues.

As seen in Figure 12, each consumer will handle a message group in the same order they were sent. To understand how the consumers get their batches, see this post.

Sending more than 256KB

SQS has a limitation of allowing the message to contain up to 256KB in size. While this is enough for most applications, there are situations where we may need more.

A common situation is to store the actual content in a separate medium — — S3, database — — and provide as part of the message a reference to allow the consumer to retrieve the actual content.

While the best medium may depend on your case and environment I would recommend using:

If the content is bigger than 16MB or the consumer is an application that belongs to the same domain as the producer.

In this case, you would pass the bucket and key used with S3 and the producer will be able to access it normally.

Database + API

If the content is smaller than 16MB or the consumer is an application that does not belong to the same domain as the producer.

For contents smaller than 400KB DynamoDB would be a good option and DocumentDB if it is bigger than that.

You would expose this content via some read-only API and send the URL to the specific message as part of the message. This way you can easily control the access using any traditional API authentication/authorization schemes.

Figure 13. Using DynamoDB to store big payloads with an API to retrieve the information.

In all cases, it is recommended to set up some sort of expiration policy to avoid retaining the actual contents of the message for longer than it needs to be. Each persistence medium that I shared offers a way to define a TTL that automatically evicts the content after its expiration.

What’s next

This demonstrates that there is already a lot you can achieve with SQS, even with more complex scenarios. At the same time, it is important to pay attention to the limitations associated with its features, such as message deduplication, which is time-bound. This will prevent you from having some hard-to-debug production issues that only manifest themselves under high load.

In the next article, I will present SNS which can operate with SQS to provide even greater flexibility and functionality. Stay tuned!

Editorial reviews by Deanna Chow, Liela Touré & Pablo Martinez

Want to work with us? Click here to see all open positions at SSENSE!