Building real-time data pipelines with Google Cloud Pub/Sub

Raju Ahmed Shetu
intelligentmachines
6 min readJan 26, 2020

Motivation

Whether we build any distributed system or any real time data pipeline for any applications, a streaming platform is what we need at the very first place. And when it comes to streaming platform, Apache Kafka is the first thing that comes to our mind. But setting up Apache Kafka for production is not a piece of cake. Is there any managed services for this? Fortunately yes. Google Cloud Pub/Sub is a global messaging and event ingestion platform which allow us to do exactly that. Let’s know some basic keywords of a streaming platform and implement one with Google Cloud Pub/Sub.

What is an event streaming platform?

Most of our data populate from different apps and web platform are done through http/https requests to our backend servers. So whenever we want to make a distributed system then the communication in between services is hard to handle with http requests because of the on demand resource request architecture.

If service A want some data from service B, C or know the state of service B, C then it might have to make two http requests to get it. On that time B, C might be offline. In this case, the service A will not get any data.

What if B, C produce an event whenever it gets any data and service A subscribed the event that B or C produces, it could easily updates itself as soon as the event is fired by service B, C.

So, a streaming platform, enables different sources to produce events and let the other services or apps consumes those events from it. It also maintains a queue for services or apps that are connected to it. So if any event is being produced when a consumer is down, it will get those events when it will come online if those events are within its retention period.

Keywords to focus

  • Topic: Name of the event that is fired.
  • Producer: A producer is the component which actually fires an event. Each event has a topic name and data that are passed from producer. In Pub/Sub, producers are also known as publishers.
  • Consumer: Consumers are responsible to act on when an event is fired. Generally a consumer subscribes to a topic name so whenever any event is fired with the topic name it subscribed, it get the data associated with the event. In Pub/Sub, consumers are also known as subscribers.
  • Consumer Group: Whenever we are having high traffic on an event, it’s pretty trivial to deploy more consumers so that we can process the data faster. But what if all the consumers get same event, then actually we will consume same data for all the consumers. Consumer Group ensures that if one consumer handles an event that will be marked as done by that group so that no other consumer on that group don’t act upon it. Thus one event will only be catered once by each consumer group. In Pub/Sub, if one or more instances from subscriber class is live, then Pub/Sub actually send chunks of all events to each instances so that no instances get duplicate event among themselves.
  • Subscription: In order to get published events a consumer/subscriber must create a subscription to a specific topic name so that when it fires event, the instances with specific subscription gets the data of that event.
Courtesy: https://cloud.google.com/pubsub/docs/quickstart-py-mac

Let’s assume, we have a google cloud project configured to use.

Let’s do it !

  1. Create a service account with Pub/Sub Admin permission and download the service account file as json file in our computer. The images are follows:
Service Account Creation

2. Create a virtual environment, activate that and install google-cloud-pubsub with the following command

$ virtualenv venv && source venv/bin/activate && pip install google-cloud-pubsub

3. Create a folder named pubsub and inside that make a package named publishers Inside the publishers package create a file named pub.py Here are the contents of the files

pub.py

here line 1-2 are the necessary imports. line 4-7 contains the configurations of the projects. your-project-id should be the google cloud proejct id and file_path_of_json would be the location of the json key in step 1 in our local computer. We are declaring the publisher, project_path and topic path from line 9-11 We are listing out all the topic name in this project path from line 13-15 Then we create a topic if not already present in that project path topics from line 17-19 Now we will be publishing 10 messages with our publisher from line 21-30 What is actually did is, it creates a dict with keys count and message with simple text and number and encode into string message and publish as event with topic_name sample-call And lastly upon finish it’s finished a simple message saying Published messages with custom attributes

4. And now here goes the subscriber’s code. Create a subscribers package under pubsub directory and make a file named sub.py

sub.py

Here line 1-2 and line 4-7 are the same explanations. Please make sure you have the same topic name in both files, otherwise the subscriber will not work as it won’t create any subscription to that topic_name line 9-14 describes the subscriber, topic_path, project_path and subscription_path. line 16-18 gathers the existing subscriptions. line 20-25 is where we create our subscription if not found in existing subscriptions. From line 30-33 we are writing our call back functions as from publisher we are getting a dict type data so we are just extracting it, printing it and acknowledging it. From line 35-37 we are creating our subscription pull feature so that we can get our data from event. There are two types of subscribers. More in here. And lastly we are waiting from line 42-45 indefinitely to receive results from the events.

Moment of Truth

We will check at first if our publisher and subscribers work at all or not. For this reason we will open two terminal window and run publishers/pub.py and subscribers/sub.py separately. Here is the output of mine

Publisher output on left & subscriber output on right

Now let’s check out if the messages are fanning out to instances of same subscription. To test that, close all the terminal and follow the instructions

  1. Open a terminal window and run python subscribers/sub.py
  2. Open another terminal window and run again python subscribers/sub.py

At this stage you will be having two instances of same subscriber. Now open another terminal window and run python publishers/pub.py Here is the output

publisher’s output
first instance of subscriber’s output
second instance of subscriber’s output

Here we can see the Pub/Sub platform fans out the events to instances of same subscribers.

Last Words

So here goes the source code of this entire article. I have gone through pretty basic stuff in event streaming platform. I will write more about it after some day inshallah !!!

--

--