Building real-time data pipelines with Google Cloud Pub/Sub
Whether we build any distributed system or any real time data pipeline for any applications, a streaming platform is what we need at the very first place. And when it comes to streaming platform, Apache Kafka is the first thing that comes to our mind. But setting up Apache Kafka for production is not a piece of cake. Is there any managed services for this? Fortunately yes. Google Cloud Pub/Sub is a global messaging and event ingestion platform which allow us to do exactly that. Let’s know some basic keywords of a streaming platform and implement one with Google Cloud Pub/Sub.
What is an event streaming platform?
Most of our data populate from different apps and web platform are done through http/https requests to our backend servers. So whenever we want to make a distributed system then the communication in between services is hard to handle with http requests because of the on demand resource request architecture.
If service A want some data from service B, C or know the state of service B, C then it might have to make two http requests to get it. On that time B, C might be offline. In this case, the service A will not get any data.
What if B, C produce an event whenever it gets any data and service A subscribed the event that B or C produces, it could easily updates itself as soon as the event is fired by service B, C.
So, a streaming platform, enables different sources to produce events and let the other services or apps consumes those events from it. It also maintains a queue for services or apps that are connected to it. So if any event is being produced when a consumer is down, it will get those events when it will come online if those events are within its retention period.
Keywords to focus
- Topic: Name of the event that is fired.
- Producer: A producer is the component which actually fires an event. Each event has a topic name and data that are passed from producer. In Pub/Sub, producers are also known as publishers.
- Consumer: Consumers are responsible to act on when an event is fired. Generally a consumer subscribes to a topic name so whenever any event is fired with the topic name it subscribed, it get the data associated with the event. In Pub/Sub, consumers are also known as subscribers.
- Consumer Group: Whenever we are having high traffic on an event, it’s pretty trivial to deploy more consumers so that we can process the data faster. But what if all the consumers get same event, then actually we will consume same data for all the consumers. Consumer Group ensures that if one consumer handles an event that will be marked as done by that group so that no other consumer on that group don’t act upon it. Thus one event will only be catered once by each consumer group. In Pub/Sub, if one or more instances from subscriber class is live, then Pub/Sub actually send chunks of all events to each instances so that no instances get duplicate event among themselves.
- Subscription: In order to get published events a consumer/subscriber must create a subscription to a specific topic name so that when it fires event, the instances with specific subscription gets the data of that event.
Let’s assume, we have a google cloud project configured to use.
Let’s do it !
- Create a service account with Pub/Sub Admin permission and download the service account file as json file in our computer. The images are follows:
2. Create a virtual environment, activate that and install
google-cloud-pubsub with the following command
$ virtualenv venv && source venv/bin/activate && pip install google-cloud-pubsub
3. Create a folder named
pubsub and inside that make a package named
publishers Inside the
publishers package create a file named
pub.py Here are the contents of the files
line 1-2 are the necessary imports.
line 4-7 contains the configurations of the projects.
your-project-id should be the google cloud proejct id and
file_path_of_json would be the location of the json key in
step 1 in our local computer. We are declaring the publisher, project_path and topic path from
line 9-11 We are listing out all the topic name in this project path from
line 13-15 Then we create a topic if not already present in that project path topics from
line 17-19 Now we will be publishing 10 messages with our
line 21-30 What is actually did is, it creates a
dict with keys
message with simple text and number and encode into string message and publish as event with
sample-call And lastly upon finish it’s finished a simple message saying
Published messages with custom attributes
4. And now here goes the subscriber’s code. Create a
subscribers package under
pubsub directory and make a file named
line 1-2 and
line 4-7 are the same explanations. Please make sure you have the same topic name in both files, otherwise the subscriber will not work as it won’t create any subscription to that
line 9-14 describes the subscriber, topic_path, project_path and subscription_path.
line 16-18 gathers the existing subscriptions.
line 20-25 is where we create our subscription if not found in existing subscriptions. From
line 30-33 we are writing our call back functions as from publisher we are getting a
dict type data so we are just extracting it, printing it and acknowledging it. From
line 35-37 we are creating our subscription pull feature so that we can get our data from event. There are two types of subscribers. More in here. And lastly we are waiting from
line 42-45 indefinitely to receive results from the events.
Moment of Truth
We will check at first if our publisher and subscribers work at all or not. For this reason we will open two terminal window and run
subscribers/sub.py separately. Here is the output of mine
Now let’s check out if the messages are fanning out to instances of same subscription. To test that, close all the terminal and follow the instructions
- Open a terminal window and run
- Open another terminal window and run again
At this stage you will be having two instances of same subscriber. Now open another terminal window and run
python publishers/pub.py Here is the output
Here we can see the Pub/Sub platform fans out the events to instances of same subscribers.
So here goes the source code of this entire article. I have gone through pretty basic stuff in event streaming platform. I will write more about it after some day inshallah !!!