Utilizing Cloud Pub/Sub for Efficient Data Pipelines

A cost-effective solution for queuing messages in data pipelines on GCP

Alexander Goida
Nerd For Tech
6 min readMar 27, 2024

--

We use Cloud Pub/Sub in our data pipelines to queue messages for processing. It provides an inexpensive, seamless integration with the GCP messaging system. These messages contain information about files for processing in the data lake and updates of data objects in the data warehouse.

Unlike some messaging solutions, Pub/Sub offers parallelization on a per-message basis, similar to a “Shared” type subscription in Apache Pulsar. This article will discuss creating Pub/Sub topics, subscribers and some aspects of it’s usage for data pipelines.

How we use it

We use pub/sub to implement queues between the steps of our data pipelines. These pipelines have two major steps. The first step processes files that arrive at our data lake, while the second step processes data object updates, which are identified after processing new files.

To simplify the configuration of ingestion pipelines, we use a topic per integration pipeline. Every file that arrives at the data lake is republished to our topic, creating a queue of files. Our data lake is built on top of Google Cloud Storage. We can subscribe functions to the event of new file arrivals, which republish information about these files to another topic in a format that our jobs understand. Using a separate topic for incoming files allows us to reprocess existing files with our ETLs.

Simplified our data pipeline

We have chosen to use a separate topic for every ETL that processes our files. Alternatively, this could be done with a single topic that contains messages with different attributes and multiple subscriptions that filter using these attributes. This method is used in the second step of our data pipelines for data object updates. Our ETLs publish object updates to another topic for such messages, and each message is attributed with properties that define the type of data inside them.

In both cases, messages do not contain the content we are processing. Instead, they carry brief information about the file name or object IDs for updates. Additionally, if the size of a message exceeds a certain limit (one megabyte), we compress it before sending.

This complexity is hidden inside our Python package that works with Pub/Sub. As a result, client code does not need to perform any special operations to send or receive data from topics. It simply sends a dictionary or data object inherited from Pydantic classes.

What to know when using Pub/Sub

Service Types & Cost

There are two services: Pub/Sub and Pub/Sub Light. The standard version and the light version have many differences, one of which is the cost calculation method. The standard version calculates cost based on the volume of transmitted data. On the other hand, in the light version, you need to purchase units to provision expected throughput, irrespective of the volume.

Further I will be talking about standard version.

Receiving Messages

I suggest using streaming pull to receive messages from Pub/Sub (link). Our use case involves buffering messages in a queue and polling them periodically for processing, which fits well with this method.

Initially, we tried the synchronous pull method (link). However, it didn’t perform as expected. Despite the queued amount and method parameters, it returned few messages each time it was called.

One parameter, max_messages, supposedly sets the maximum number of return messages for a request. In reality, the returned number varied, often falling short of the maximum.

Our experiments showed that the streaming pull approach is more effective. Unlike synchronous polling, it yields more messages each call.

It’s important to note that a streaming polling process can receive a maximum of 1000 messages, a limit that can’t be changed.

The Pub/sub service smartly manages message delivery. I found a related thread on a community forum that might be of interest.

Extend ack time of Messages

Ensure your messages have an appropriate acknowledgment deadline. If the deadline is set to one minute, but processing takes longer, all messages will be re-delivered before the process concludes. Hence, it’s crucial to understand your processing time and set the acknowledgment deadline longer than the longest job duration.

However, keep in mind there is a limit for the deadline time. It cannot exceed 10 minutes, meaning your jobs should not last longer than this.

Subscriptions with filters

You don’t have to make a new topic for each object type. All messages can go in the same topic, with subscribers filtering what they want. The infrastructure takes care of filtering, so your code doesn’t handle returning filtered messages to the queue.

To do this, add attributes to messages. Subscriptions use these to decide which messages they receive.

In our context, we use this for object updates. We send messages with ID lists in the object update topic, all of the same type. The object type is indicated within the message by an attribute. We then have many subscribers with different filters, based on this attribute. For instance, one subscriber might read ‘currency’ type messages, another ‘customers’ type messages, etc.

Balance between message size and processing time

For a pipeline with many processing steps, you need to balance data volume with processing time. If data piles up in a queue, you might see that during spikes, the data can be quite large. Sending all the data in one message could take too long to process, possibly exceeding the allowed acknowledgement time and causing redelivery issues.

Instead of sending as much data as possible in one message, consider breaking the data into smaller chunks and sending multiple messages. This method lets each message carry less data for processing. Even if there are many messages, you can scale the system to handle more messages at the same time.

However, too many small messages can cause system overhead and management problems, which can reduce overall throughput.

Other Useful Things

Storing messages directly to cloud storage can be beneficial, especially for managing a dead letter topic and setting up alerts about them. In this case, the retention in the subscription might be minimal, as the data will be stored on cloud storage, which is convenient for analysis.

It could also be useful if the topic itself, along with the subscription, has retention. This feature can be handy if you don’t have a subscription but still want to accumulate messages.

Lastly, a subscription can be of the push type. In this case, you’ll need to specify an endpoint, which will receive the message as a parameter. This method can be useful for triggering event-driven actions.

Wrapping Up

Cloud Pub/Sub is a powerful and adaptable tool for managing data. It’s a cost-effective solution for organizing data pipelines. This service boasts numerous configuration options and features, too many to list in a single article. We view it as a cost-saving substitute for other components currently used for data transfer in our infrastructure. Of course, our use case is just one of many. If it resembles your scenario, feel free to reach out with any questions. I’m always interested in tackling new technical challenges.

--

--

Alexander Goida
Nerd For Tech

Software Architect in Cloud Services and Data Solutions