Scaling Applications Part 6 — Adding Resiliency in Messaging

Hughie Coles
5 min readJul 9, 2020

--

Photo by Sean Pollock on Unsplash

This is the 6th in a 7-part series on how to scale applications. The series will cover common scaling patterns, as well as their pros, cons, and caveats. *Note — This structure may change as the series goes along.

Setup

In the series so far, we’ve focused on software architectures that can help you scale your application.

This post is going to be slightly different. We won’t be focusing on a way to structure your application. Instead, we’ll be focusing on a tool that is very common within distributed systems and can help your application be more fault tolerant. That tool is a called a queue.

Most developers are familiar with the Queue data structure. It is a First-In First-Out (FIFO) data structure in which items are added at the back and removed from the front. It ensures that items are removed in the same order that they are added.

In this article, I’ll be covering uses for this data structure, as well as talking about applications called Message Queues that implement the same FIFO behavior, but which serve a larger purpose.

Problem

In a distributed system, applications can be running on different machines (virtual or physical), or even on different processes on the same machine. This separation means that we can’t communicate directly with these separate parts via function calls. We need to use different methods of communication. These can be pipes or network messages (TCP or HTTP). One thing that all methods of communication within a distributed system have in common is failure.

Many common communication patterns are synchronous. If a service needs to communicate with remote service across a network, it can send an HTTP message across the network and then block and wait for the response, then move on with its next task (which is usually processing that response). What happens if the remote service is down? What happens if the remote service takes too long to respond and the call times out? What happens if there’s just an error on the receiving side?

Solution

The solution is to use queues as buffers on the write side (sender), the read side (receiver), or as an intermediary.

In the case of using a queue on the sending side, you use a kind of store-and-forward model. Instead of sending a message directly to the receiver, you queue the message, and then have your application or a job pick messages off of the queue and send it, re-queuing on failure. This has the advantage of allowing the process that would have sent the message to store the message immediately and move on to its next task as if the message had already been sent instead of blocking and waiting for the message to be sent, processed, and a response sent.

In the case of queuing on the receiving side, you would immediately queue the message upon receipt. You would then have some sort of consuming process that processes the message, possibly sending a message back to the original sender if required. This allows you to decouple the processing from the sender completely, as well as acting as a sort of load balancer (which I’ll touch on later)

Finally, for an intermediate, you would spin up a separate service or machine hosting a Message Queue to consume incoming messages and either send them to receivers, or wait for receivers to poll for new messages. Examples of such a service are RabbitMQ or Kafka. A Message Queue can distribute messages based on, among other things, information contained in the message or sender id or location. This allows you to distribute the load evenly, or based on latency. These messages also allow you and your messages to survive application failures, since the queue is outside the applications, and can be backed by disk storage.

For more redundancy and increased control these 3 techniques can be combined. For instance, you can store on the sending side to protect against network failure, but also store on the receiving side to protect against overloading the receiver or an error in processing of the message propagating to the sender (consider the sender getting a 500 error).

Alternative Solutions

Retries

The only real alternative to queuing is to retry on failure. This can be something like a 3-try policy with exponential back-off. The problem is that if the problem is more than just intermittent, eventually you will have to let the attempt fail and move on. You would usually just log the failure. This is fine if the message is sent from some sort of storage-backed situation. However, if the message is sent as the direct result of an action (eg. user click, order submission, payment), then you may lose the event entirely if you’re doing things in a synchronous manner.

Database Storage

This is essentially the same technique with a different face. You can store the message record with a time stamp, allowing you to preserve order and get the oldest record first for processing. It allows you to use the database technology you’re familiar with and are presumably already using.

Pros

Queues are easy to get started with, They allow for a very natural way of thinking the the data being processed, like a pipeline. Generally leaning into patterns that we’re familiar with allows us to me more productive.

Queues have an added bonus of being able to balance load. If the receiver gets a spike, it can buffer those in the queue and better handle the load.

Cons

Although the queue itself is very natural to think about, you’re losing the synchronicity that you, as a developer are familiar with. Instead of waiting for a response and handling it then, you’re potentially sending that message after a delay, and must process the response at an undetermined time. You also have to store any sort of context that the call would have been made in if that data is required to process the response.

Queuing also adds latency, and is unsuitable if you need an immediate answer in order to proceed. For instance, in a payment system, you may wish to get confirmation that the payment succeeded before moving onto the shipping step. This completely depends on how you structure your application, as well as your business requirements.

Summary

Queues are an essential tool when dealing with distributed systems where the potential for network or general communication failures are common. They allow you to balance work, to handle it asynchronously, and to potentially have it survive application failures when using storage backing. This can lead to a more robust, failure-tolerant application, which is increasingly important as you move towards distributed architectures such as SOA and Microservices.

--

--

Hughie Coles

I’m an EM, full-stack developer, speaker, mentor, and writer. I blog about software development, software architecture, leadership, and culture. @hughiecoles