A Practical Guide to Queues, Streams, Jobs, and Workflows

Gábor Farkas
12 min readSep 11, 2024

--

Navigating the landscape of queues, streams, job, and workflow systems can be overwhelming. Often, it’s difficult to understand how certain frameworks truly operate, as key information gets buried beneath buzzwords like “scalable” and “resilient”. These terms can overshadow the core functionality, leaving one unsure about what the system actually does. The fact that some terms are used differently across systems adds to the confusion, making evaluation even more challenging.

In this post, I aim to provide a clear map of these systems — what features you should be looking for, as well as some important considerations that may not come to mind immediately. However, this post won’t serve as a comprehensive decision guide, as the specific nuances of each system heavily impact how well they fit your use case.

I’m organizing these systems into the following categories:

This article will not cover data pipeline frameworks like Apache Flink and Spark, nor actor model systems such as Orleans or Akka. For workflows, I’ll focus specifically on technical workflow coordination solutions rather than business process management tools.

Queues

Queues are probably the most widely known and commonly used systems. In terms of functionality, a publisher (or producer) places a message into the queue, which is later consumed by subscribers (or consumers).

  • Queue systems typically do not parse or alter the message data; the message is treated as a byte array. However, messages usually have header or envelope information and may also include user-defined attributes.
  • Consuming a message “destroys it”: it removes it from the queue. Messages are not meant to be durably persisted, although some systems have built-in persistence to recover from node failures.
  • Queue systems are usually standalone processes or clusters.

Message Consumption

Queues offer multiple ways to route and deliver messages to their intended targets.

The simplest model involves a single publisher, a single FIFO queue, and a single consumer. These can operate as three distinct processes in different locations.

Queues generally support multiple subscribers to the same queue. In this scenario, consumers compete for messages, with each message being processed by only one consumer. This setup is often used for load distribution, and such a group of consumers is referred to as a “consumer group” (RabbitMQ) or “subscription” (GCP Pub/Sub). NATS, on the other hand, uses topics instead of queues, and consumer groups are referred to as “queues.”

Messaging systems can also support ‘fan-out’ to distribute copies of certain messages to multiple consumers or consumer groups. This is useful when multiple systems need to act on the same message.

Message Routing

Many queue systems support some form of message routing. A common approach is for publishers to send messages to topic, subjects or brokers rather than directly to queues. Predefined routing rules then determine which queues receive a copy of the message, typically based on message attributes. In NATS, for example, messages are published to a subject (a string defined ad-hoc), and subscribers set rules to match specific topics to their subscriptions.

Most systems allow you to declare queues and routing rules at runtime through API calls, though some may require queues to be predefined. Even when runtime declaration is available, you might choose to limit queue configuration to a deploy-time step, providing more centralized control.

Routing key-based routing in RabbitMQ

Filtering

Certain systems also allow you to define filtering rules to reduce the number of messages delivered to consumers. Some client libraries offer client-side message filtering, while others support server-side filtering. For instance, RabbitMQ enables server-side filtering based on a specified header value. Server-side filtering can save both bandwidth and CPU resources on the consumer side. NATS offers similar functionality by allowing subscriptions to topics matched with wildcards.

Message Queue or Pub-Sub?

The distinction between message queues and publish-subscribe (Pub/Sub) models is often quoted, but I believe it’s more of a semantic difference rather than a functional one. The way you think about the integration in question might shape how you use these terms, and different systems often define them in varied ways.

For example, GCP Pub/Sub behaves more like a message queue where subscriptions function as consumer groups, as each subscription can only listen to one topic.

NATS, on the other hand, provides more flexible routing with its structured string subjects, allowing subscriptions to match multiple subjects using wildcard expressions.

RabbitMQ primarily functions as a message queue but also offers routing capabilities at the point of message ingestion, enabling you to use it with Pub/Sub semantics as well.

Pull or Push

Two commonly discussed consumption models are Pull and Push. In the Pull model, the consumer initiates the retrieval of new messages, while in the Push model, the server sends messages to consumers as they are received.

In systems like Google Cloud Tasks or Pub/Sub, “Push” means that the queue system can directly call an HTTP endpoint to deliver messages. In contrast, with RabbitMQ, NATS, or Redis Stream, the consumer must first establish a TCP connection, after which the server pushes new messages through those connections.

In the Pull model, the client establishes the connection and requests new messages when it is ready to process them. This model is used by systems like Apache Kafka. Many systems also support batched message retrieval, where the consumer fetches a larger set of messages at once to process them efficiently. These batches come with a timeout, and if processing isn’t acknowledged within that time, the messages are returned to the queue, allowing other consumers to process them.

Consistency

Consistency in messaging systems can encompass several different aspects, and various systems may offer slightly different approaches to meet the same requirements.

Acknowledgements are used on both the producer and consumer sides. On the producer side, the messaging system can acknowledge that it has received and processed a message. On the consumer side, consumers acknowledge the successful receipt of messages.

If an acknowledgement isn’t received in time, the producer or the messaging system might redeliver the message. Systems can offer different delivery guarantees, such as at-least-once, at-most-once, or exactly-once delivery. It’s a complex topic, so it’s essential to understand distributed system concepts like timeouts, retries, circuit breaking, and idempotency. There are several excellent books on these subjects worth exploring.

You should also be aware of a system’s persistence and durability capabilities. What happens to messages if all consumers are temporarily disconnected? Are messages retained? Will retained messages survive a system restart or node failure?

Different delivery mechanisms may also affect message ordering. Depending on your use case, it could be crucial for messages to arrive in the exact order they were published.

While not directly related to functional consistency, it’s equally important to evaluate the system’s clustering and scalability capabilities, as well as its throughput, latency characteristics, and fault tolerance to ensure they meet your requirements.

Streams

Functionally, streams can be seen as an extension of queues: message consumption does not delete the messages. Instead, they remain available for an extended period of time, usually controlled by time-to-live (TTL) or cardinality limits, and can be reprocessed by consumers if needed.

Common examples include Apache Kafka, Redis Stream, and NATS JetStream.

When a consumer connects to a server, it can reprocess all available messages, choose to receive only new messages, or start processing from a specific offset. Consumers typically aim to process each message only once. Stream servers provide mechanisms for consumers to mark messages as processed, preventing them from being redelivered to that particular consumer (or consumer group). These systems offer various ways to handle acknowledgements, so it’s important to understand how they align with your consistency and performance requirements.

Regarding the functionality, streams can be used in place of message queues. However, they generally require more memory and disk space to operate.

Deciding whether a queue or stream system is better suited to your use case can be complex and depends on the specific features and trade-offs of the systems being evaluated.

Background Job / Task systems

Job or Task systems can be viewed as Queues specialized for a common specialized use case: instead of letting some (often external) system know about an event, we want to get something done in the background.

Functionally, Job/Task systems specialize message queues in the following ways:

  • Message bodies contain descriptions of the work to be done — often a serialized data structure that outlines a message type and its parameters, or instructions for the work dispatcher, such as “invoke method X on object Y with parameters Z.”
  • In most cases, the messages are consumed by the same system that produces them. The workers must be able to interpret the message in a way that aligns with the publisher’s intent. However, it is technically possible to create a system where the workers (consumers) are systems entirely separate from the producers.

Message storage and handling are usually configurable. Some systems, like River or Neoq, interact directly with a database, while others, such as Celery, support underlying messaging systems like RabbitMQ, Redis, or SQS.

The systems I’ve checked come as libraries that you can include in your own application.

Typically, each message will have a message type (usually identified by a string) and a set of parameters. When implementing a processing server, you register handler methods for each type of message. The library ensures that the appropriate handler is invoked based on the message type and also manages the concurrency of the processing.

// Code Example from Neoq
nq.Enqueue(ctx, &jobs.Job{
Queue: "greetings",
Payload: map[string]any{
"message": "hello world",
},
})
//-- start a worker:
nq.Start(ctx, handler.New("greetings", func(ctx context.Context) (err error) {
j, _ := jobs.FromContext(ctx)
log.Println("got job id:", j.ID, "messsage:", j.Payload["message"])
return
}))

Other systems might allow you to use certain classes or lambda expressions for publishing tasks. For example, JobRunr supports lambda expressions, but these are also serialized into a JSON representation in the background (and the lambda expression itself can only contain a method invocation):

BackgroundJob.enqueue(() ->
emailService.sendEmail(
customerEmail,
"hello@jobrunr.io",
"Happy you joined us!",
"the email body..."));

These systems often provide dashboards, either as code that you can integrate into your own server or as standalone applications.

While these libraries typically abstract away many details discussed in the context of queues, it’s important to understand how your chosen library handles these aspects, such as message acknowledgements during publishing and consumption.

For instance, the River queue interacts directly with a database for message storage and allows you to share the same transaction. This feature enables you to enqueue background work items within the same transaction as modifications to your business entities. This can be beneficial for systems with a relatively low volume of background tasks. However, for systems with a high number of background tasks, storing these tasks in the same database as your business entities can lead to capacity issues and other operational challenges.

All these engines assume that your workers use compatible task type declarations. As a developer, you must consider task compatibility, versioning, and strategies for progressive rollouts and rollbacks if needed.

Workflows

Workflows extend the functionality of tasks by grouping units of work, allowing them to share context and establish dependencies between each other.

There are three major categories of workflow engines:

  • Declarative
  • Imperative
  • Durable Functions

The key functional distinction lies in whether your workflow tasks are predefined, configurable items (such as executing a container command or calling an external HTTP endpoint) or custom methods that you write and host within your own system.

Declarative engines

Declarative workflow engines involve configuring a workflow type by providing a declaration of workflow steps. This is often done using YAML or XML. Some systems allow you to create the workflow definition through imperative code using a client library, but the resulting description is still used as a static declaration.

# Example from Tork
---
name: sample each job
tasks:
- name: hello task
image: ubuntu:mantic
run: echo start of job

- name: sample each task
each:
list: "{{ sequence(1,5) }}"
task:
name: output task item
var: eachTask{{item.index}}
image: ubuntu:mantic
env:
ITEM: "{{item.value}}"
run: echo -n $ITEM > $TORK_OUTPUT

- name: sample each task with custom var
each:
list: "{{ sequence(1,5) }}"
var: "myitem"
task:
name: output task item
var: eachTask{{myitem.index}}
image: ubuntu:mantic
env:
ITEM: "{{myitem.value}}"
run: echo -n $ITEM > $TORK_OUTPUT

- name: bye task
image: ubuntu:mantic
run: echo end of job

These systems typically offer a predefined set of workflow steps. For example, Tork allows you to execute containers at each step, while DolphinScheduler, Google Cloud Workflows, and AWS Step Functions provide a broad range of possible steps, such as making HTTP calls, writing to a database, sending emails, etc.

They may also include a visual designer. They operate as standalone deployments, managing the execution environments for the workflows.

Workflows in these systems generally maintain state (context variables) to share data between steps and support basic flow control features, such as loops and conditionals.

Imperative engines

Imperative workflow engines function similarly to Job/Task engines: you declare workflows as methods within your codebase. Examples include nFlow and Hatchet. These systems come with a standalone coordinator service, and you start workers, created using their SDKs, that connect to the coordinator to poll for work. Like Job/Task engines, they rely on a configurable database to maintain state and manage the work queue.

The example below is from Hatchet:

@hatchet.workflow(on_events=["question:create"])
class BasicRagWorkflow:
@hatchet.step()
def start(self, context: Context):
return {
"status": "starting...",
}
@hatchet.step()
def load_docs(self, context: Context):
# Load the relevant documents
return {
"status": "docs loaded",
"docs": text_content,
}

@hatchet.step(parents=["load_docs"])
def reason_docs(self, ctx: Context):
docs = ctx.step_output("load_docs")['docs']
# Reason about the relevant docs

return {
"status": "writing a response",
"research": research,
}

@hatchet.step(parents=["reason_docs"])
def generate_response(self, ctx: Context):
docs = ctx.step_output("reason_docs")['research']
# Reason about the relevant docs

return {
"status": "complete",
"message": message,
}

The term “Imperative” might not be the most fitting, but it is commonly used. Although all the systems I’ve seen use imperative, runtime code to configure workflows, they could technically also use JSON or YAML configurations at deployment time. The key functional aspect is that they invoke your methods directly in your code, similar to a Job/Task system.

Durable functions

Temporal, Dapr, and Inngest, take a different approach. Instead of the workflow “states” being individual methods, you write an imperative method that declares both the individual steps and their relationships:

import { inngest } from "./client";

export default inngest.createFunction(
{ id: "import-product-images" },
{ event: "shop/product.imported" },
async ({ event, step }) => {
const s3Urls = await step.run("copy-images-to-s3", async () => {
return copyAllImagesToS3(event.data.imageURLs);
});
await step.run('resize-images', async () => {
await resizer.bulk({ urls: s3Urls, quality: 0.9, maxWidth: 1024 });
})
};
);

These systems perform a sort of “magic” in managing execution. The workflow declaration method starts like a normal method execution, being loaded into the CPU and processed. When a workflow step is called, the execution engine may suspend the method itself. Upon completion of the workflow task, the execution resumes from where it left off. To achieve this, the workflow system re-invokes your workflow declaration method, and when a previously executed step is called again, it returns instantly, ensuring that the method resumes from the same point and state.

This approach requires that your method’s execution be deterministic and stable.

Both Temporal and Inngest expose workflows as HTTP endpoints, with an external workflow controller deployment managing these endpoints and the workflow state. These services also offer subscription-based cloud services for the workflow controller.

Versioning and compatibility

Similar to Job/Task systems, proper versioning of workflow declarations and workflow steps is essential and falls under your responsibility.

A unique challenge with workflow engines is handling version migration for existing workflows. Ensure you understand how the workflow engine manages these migrations. Additionally, consider the capabilities for mass workflow cancellations or suspensions for troubleshooting, and review the engine’s reporting, logging, and monitoring features.

Summary

I hope this article has provided a useful overview of the functionality offered by various queue, workflow, and Job/Task systems. Understanding these systems’ features and capabilities can help you make informed decisions based on your specific needs.

--

--