Batching for cloud queues: spend 10x less

Published in

YouScan

6 min readMay 16, 2024

Cash flow consists of two parts: money in and money out. While increasing earnings is a priority, cutting expenses without compromising product quality is also important.

This story is about how we reduced our bills for Azure Queues operations by more than ten times, which is around 45k $ per year in our scale. It might not impress you a lot, but in Ukraine, where our technical team is based, you could hire a strong developer for that money.

Our story aims to inspire companies to monitor their soaring cloud expenses. We’re excited to share our own journey and the Azure Batch Queue open-source library we developed, as we didn’t find any alternatives in the market.

Who can benefit?

The library is valuable for organizations with significant queue workloads, typically exceeding millions of daily operations. Anything less would not be noticeable on your cloud bill.

Azure Queues, particularly, are very cheap — 10000 operations cost $0.0004.

How come did we pay so much for queues? In the social listening industry, YouScan collects and analyzes thousands of social media and news sources, resulting in billions of daily operations.

Idea

Processing one message involves three billed operations: write, read, and complete.

Batching multiple items together and sending them in one message reduces the number of operations (and costs) proportionally.

So now, when you send a queue message with 50 items, you pay just for 3 operations instead of 150 operations (50 writes, reads, completes).

Requirements

The idea is clear. Let’s outline the key requirements for our batch queue:

Ensure no item is lost;
Minimize duplicate processing;
Support large messages;
Handle poison messages effectively;
Easy switch from Azure.Storage.Queues

Implementation

Before diving into our batch queue's unique requirements, let’s first discuss two crucial aspects that apply to both batch and single-item queues: supporting large messages and effectively handling poison messages.

Support large messages

The limit for the Azure queue message is 64KB for XML and 48KB for Base64. If the item we send is larger than allowed, we offload the content of the message to the Azure Blob and send only reference to the data through the queue. So it doesn’t matter whether the client sends a huge item or a batch with hundreds of items in a single message — we will deliver it nonetheless.

Handle poison messages effectively

Message processing can fail for various reasons, like the downtime of the external service, non-backward compatible contract change, or corrupted message data. When that happens, we send the message back to the queue because, in most cases, the problem was due to timeouts from other services. But we don’t allow poison messages to degrade pipeline throughput with never-ending retries. If the message fails a configured number of times, it will be sent to a separate quarantine queue. An engineer can later inspect the messages and situation and, after resolving it, send messages back to processing.

The complete pipeline of the message queue looks like this:

Batch queue implementation

New challenges arise when sending a batch of multiple records in a single message. How do we let consumers complete every record individually? How do we minimize duplicate processing? And most importantly, how to ensure that we deliver all the items? Let’s explore how we address these challenges.

Never lose a single item

In typical single-item message scenarios, Azure Queue ensures at-least-once delivery through message persistence and visibility timeout.

Message Persistence: Messages in Azure Queue are stored durably until explicitly deleted. This ensures that messages are not lost even during system failures or restarts.

Visibility Timeout: After dequeuing, a message is temporarily hidden from other consumers for a specified visibility timeout. If a consumer fails to process the message within this time, it reappears in the queue for another attempt.

However, when dealing with batch queues, the challenge is to guarantee at-least-once delivery for every individual item within the batch. We achieve this by completing each item independently. Consumers do not even know that the batch exists. They receive multiple items and process them separately.

Send items to processing separately without any knowledge of the batch

Under the hood, we maintain an in-memory structure to track the status change of every item, and the batch is considered processed only when all its items have been completed or failed. Once all items in a batch are processed successfully, we delete the original message from the queue. However, if any items fail processing, we return the updated message with only the failed items back to the queue for reprocessing. Even during a system crash and failure to commit processed items, we will not lose anything because a message with all records will return to the queue after the visibility timeout.

The simplified сonsumer looks like this in the code. Notice again that there is no knowledge about the batch.

var items = await batchQueue.Receive();

foreach (var item in items)
{
    await sendToPipeline(item);
}

When later in the pipeline, we need to mark the item as completed, we do the following:

item.Complete()

The same for failed items:

item.Fail()

Minimize duplicate processing

So far, we have discussed two possible states that every batch item can reach: completed or failed. However, in data flow processing, items can also become stuck for various reasons, such as when an outside service is pending or experiencing issues. With retries in place, the execution can take a long time, and the entire batch's completion could be delayed significantly, impacting the performance of other tasks and the overall system. Moreover, when the message visibility time expires, the whole batch returns to the queue and must be processed again.

For this reason, we set a timer that will trigger just before the message's visibility timeout expires. This means only failed and unprocessed items will return to the queue, and there is no need to reprocess successfully completed items.

Return failed and unprocessed items back to the queue.

Efficiency in processing

So far, so good; we send a bunch of items in a single message, and we make sure they are never lost or stuck. Let’s see what else we can improve.

Compression

In our system, we send data to queues in JSON format, and with tens or hundreds of items in a single queue message, the duplication of fields and values is significant. That’s why BatchQueue is shipped with a GZip compression serializer, which helped us significantly reduce the size of the message.

Recyclable memory

The other thing that can greatly improve application performance lies in the land of memory and GC. Billions of messages need billions of serializations that require extensive memory allocation and deallocation. This can lead to frequent garbage collection and memory fragmentation.

Microsoft’s Recyclable Memory Stream library addresses this issue by implementing a memory stream that recycles memory blocks instead of allocating new ones. It maintains a pool of memory blocks of varying sizes, and when a stream is disposed of, the memory block is returned to the pool for later reuse. This reduces the frequency of allocating new memory blocks, incurs far fewer gen 2 GCs, and reduces memory fragmentation.

Results

It’s crucial to continuously monitor infrastructure expenses and seize opportunities for optimization when they arise. Cost reduction projects offer clear goals and immediate measurability, making them highly rewarding endeavors.

We had a great time writing Azure Batch Queue and witnessed a tenfold decrease in queue costs. It has already transferred hundreds of billions of records and saved us some money. We hope more companies will start challenging their cloud expenses. Cheers!