Designing Data Processing Pipeline on Google Cloud Platform (GCP) — Part I

An architectural overview of processing big data using GCP services

--

They say, “with great data comes great responsibility”. At zeotap, dealing with data is like a daily chore and you need to have proper tools to perform it. We process terabytes of data consisting of billions of user-profiles daily. Processing data at this scale requires robust data ingestion pipelines. In this blog, we are going to describe how we can develop a data ingestion pipeline supporting both streaming and batch workloads using managed GCP services, with their pros and cons.

Use Case

In some of our use cases, we process both batch and streaming data. We aimed to make this data available to brands by connecting it to our internal data silos (or our third-party data assets), slicing-dicing, and transforming it into a 360-degree customer view. The challenge in front of us was to design a single data platform capable of handling both streaming and batch workloads together while giving the flexibility of dynamically switching the data processing logic. All this was to be achieved with minimal Operational/DevOps efforts. So we started exploring managed GCP services to build our pipeline.

Every data ingestion requires a data processing pipeline as a backbone. A data processing pipeline is fundamentally an Extract-Transform-Load (ETL) process where we read data from a source, apply certain transformations, and store it in a sink. For the article’s context, we will provision GCP resources using Google Cloud APIs. For the readers who are already familiar with various GCP services, this is what our architecture will look like in the end …

Fig 1.1: Data Pipeline Architecture

Let’s go through details of each component in the pipeline and the problem statements we faced while using them.

Problem 1: Persisting Streaming Data

A data stream is a set of events generated from different data sources at irregular intervals and with a sudden possible burst. The first challenge with such a data source is to give it a temporary persistence. The GCP component we chose to deal with this is Cloud Pub/Sub.

PubSub is GCP’s fully managed messaging service and can be understood as an alternative to RabbitMQ or Kafka. So, in layman terms, it’s a Queue.

In most of the streaming scenarios, the incoming traffic streams through an HTTP endpoint powered by a routing backend. We will persist all of the traffic in the PubSub from where it can be consumed subsequently. It provided us the following benefits,

  1. PubSub can store the messages for up to 7 days. So in the case of downstream consumer failure, we get the persistence guarantee and the traffic can be replayed again.
  2. Traffic can be fully routed to multiple consumers downstream with support for all the custom routing behavior just like RabbitMQ.

Publishing events to a PubSub topic (read Queue) is as simple as the code snippet given below

'''
Set GOOGLE_APPLICATION_CREDENTIALS to point to IAM credentials file before running this.
Run `pip install google-cloud-pubsub` to install client lib.
'''
import time
from google.cloud import pubsub_v1
# TODO project_id = "Your Google Cloud Project ID"
# TODO topic_name = "Your Pub/Sub topic name"
# Initiate client and set topic name
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
# Send the data to PubSub
future = publisher.publish(topic_path, data="Test PubSub Message")
# Prints a server-generated ID (unique within the topic) on success
print(future.result())

Our pipeline till this point is looking like this,

Fig 1.2: Stream-to-PubSub processing

Problem 2: Streaming To Batch Conversion

Processing streaming data in realtime requires at least some infrastructure to be always up and running. In one of our major use cases, we decided to merge our streaming workload with the batch workload by converting this data stream into chunks and giving it a permanent persistence. Once persisted, the problem inherently becomes a batch ingestion problem that can be consumed and replayed at will. The second component we chose for this is Cloud Dataflow.

Dataflow is GCP’s fully managed service for streaming analytics based on the Apache Beam programming model which supports a wide variety of use-cases for ETL pipelines. You can visit the list of templated use-cases here. We used a custom version of the PubSub-To-CloudStorage-Text template (built inhouse).

As the documentation states, Apache Beam is an open-source model for defining both parallel streaming and batch processing pipelines with simplified mechanics at big data scale. In this model, the pipeline is defined as a sequence of steps to be executed in a program using the Beam SDK. The program can then be run on a highly scalable processing backend of choice. Some of the popular options available are Google Cloud Dataflow, Apache Spark, Apache Flink, etc. (full list). The model gives the developer an abstraction over low-level tasks like distributed processing, coordination, task queuing, disk/memory management and allows to concentrate on only writing logic behind the pipeline.

Let’s take a quick look at code for defining an Apache Beam pipeline,

Fig 1.3: Sample Apache Beam pipeline

The pipeline here defines 3 steps of processing.

Step 1: Read the input events from PubSub.

Step 2: By using the event timestamp attached by PubSub to the event, group the events in the fixed-sized intervals.

a) To understand the concepts of event-time, windowing, and watermarking in-depth, please refer to the official Apache Beam documentation.

Step 3: Write each group to the GCS bucket once the window duration is over.

Once run, all the low-level details of executing this pipeline in parallel and at scale will be taken care of by the Dataflow processing backend.

For understanding on how to run the pipeline demonstrated above or how to write your Dataflow pipeline (either completely from scratch or by reusing the source code of predefined templates), please refer to Template Source Code section of the documentation given below,

Among other benefits, while using Dataflow, these were the major ones we observed,

  1. No concerns for the availability of PubSub consumers as it is fully managed. A new instance will be immediately spawned if the previous one goes down.
  2. On-demand horizontal autoscaling based on workload with support for worker instance-type and max-workers customizations.
  3. Apache beam’s inbuilt support for windowing the streaming data to convert it into batches.

The only con we observed was,

  1. It is not possible to automatically scale down Dataflow workers to 0 for streaming data based on the workload. That means, there will be at least one compute instance always up and running to read from PubSub. It will cost a bare minimum of $50 per month (as per current GCP pricing for 1 instance with minimum compute power).
  2. A possible workaround to this problem is to programmatically kill and restart Dataflow jobs on a need basis. Refer to this.

On Google Cloud console, the Dataflow job looks like this,

Fig 1.4: Dataflow job on GCP console

Dataflow can be configured to write data into logical components, or windows. After grouping individual events of an unbounded collection by timestamp, these batches can be written to a Google Cloud Storage (GCS) bucket.

GCS is a managed object store service provided by GCP. Consider it as an alternative to Amazon’s S3. This provided our data a permanent persistence and from here all the batch processing concepts can be applied.

Our pipeline till this point in continuation is looking like this,

Fig 1.5: PubSub-to-GCS processing using Dataflow

In this part of the blog, we saw how we converted our high-velocity streaming data to batch data using managed GCP services without developing anything from scratch. In the next part II of this blog, we will see how we can do slicing and dicing on this data and make it available for final consumption.

Security Considerations:

  1. Setting the Dataflow pipeline’s “usePublicIps” flag to true has severe security implications. This will create Compute instances in the worker pool with public IPs exposed to the internet, which can lead to DDoS attacks.
  2. Hence it is recommended to create a private subnet in the parent GCP project and set the “usePublicIps” flag to false while creating the pipeline. This will cause Dataflow service to fallback to the private subnet and use private IPs by default.

Additional References:

For more elaborated examples on publishing messages to PubSub with exception handling in different programming languages, please refer to the official documentation below.

Keep reading and playing with data! To be continued …

ABOUT THE AUTHOR

Shubham Patil is the Lead Software Engineer managing some of the core consumer products at zeotap. His core areas of expertise include designing and developing large scale distributed backend systems. For more, visit his LinkedIn profile.

--

--