MLOps: A Practical Guide to Building a Streaming Data Pipeline on GCP

Gabriel Bonanato
11 min readJan 18, 2024

--

Machine Learning Operations (MLOps) serves as the backbone of Data Science, allowing for seamless flow of data from inception to the deployment of machine learning models. This process involves making data available for algorithms and also ensuring the continuous monitoring of data and model quality.

In this quest to explore the world of MLOps, I encountered concepts like Continuous Deployment/Continuous Integration (CI/CD), drift monitoring, and feature stores. But while I found countless tutorials and videos on MLOps, I yearned for a hands-on experience — a simple yet insightful journey to put some core concepts to test (without breaking the bank). It was to time to get my hands dirty!

Photo by Sabine van Straaten on Unsplash

That’s when I decided to to create a streamlined streaming data pipeline tutorial. My goal was to grasp the fundamentals of MLOps and bridge the gap between theoretical knowledge and real-world production scenarios. I wanted a tutorial, not only straightforward but also time-efficient, requiring minimal resources.

Join me as we develop a simple streaming data pipeline. This tutorial is designed to not only solidify your understanding of MLOps but also provide a connection between real-world scenario for deploying data science models. If you’re eager to share this adventure where data seamlessly flows from its source to end users, this is the perfect starting point.

1. What’s a Streaming Data Pipeline?

In real world scenario data can be processed in many different ways, depending on business particularities. Two of the most commons are batch and continuous flow of data. Batch data is sent in packadges, while continuous data are regularly been fed into the the pipeline, similar to a stream of data.

This way, a streaming data pipeline is a series of steps that allows data to continuously flow from one source to the other making it useful along the way, for example as inputs for machine learning models and also providing outputs to end users.

2. The Goal

This article aims to provide enough information to develop a simple continuous data stream to a SQL database, operate with it and write outputs back to the database - all within Google Cloud Provider (GCP) environment. The choice for using GCP was purely personal, and other cloud providers like AWS or Oracle Cloud could certainly get the job done.

By using GCP environment this article will be covering tools and applications especific to this framework. Even if you might be used to other cloud providers, rest assured: you can certainly find some transferable knowledge here!

3. Out of Scope

It is out of the scope of this tutorial accounting for the complexities of building an optimized MLOps pipeline, implementing data security protocols, or even developing machine learning algorithms. Instead, fpr example, an addition operation will be representative task for machine learning or other data manipulation operations.

Every step is simplified, for the sake of compreheension and instilling core concepts of MLOps.

4. How to execute MLOps on GCP

Prepare for a immersive journey into MLOps using GCP. This section is divided into 4 digestible chuncks, designed for a detailed tutorial.

1- Basic Setup: The basic setup will be our starting point to lay the foundations and configure your machine to interact with GCP.

2- Fundamental concepts and definitions: to ensure clarity, this step stablishes fundamental concepts and definitions crucial for our pipeline creation.

3- Connecting the dots: using the definitions and basic concepts, we will establish how these concepts and definitions are connected for building the data pipeline.

4- Deployment on GCP: finally we will bring everything togheter for deploying the data pipeline on GCP. Prompt commands and necessary python scripts are also in thsi section. — Let the fun begin!

4.1 Basic Setup

The firts step is to setup a GCP account, and yes it is necessary to register a personal credit card for that. In my case, I wasn’t charged a single dollar due to the fact that this pipeline is really simple, and Google gives you more than enough credits just for creating a GCP account. Just remember to delete everything afterwards, just to be safe.

Next, for this tutorial it is necessary to install Google SDK (Software Developer Kit). This can be from Google’s official web page: https://cloud.google.com/sdk/docs/install-sdk. This is an important tool that will be used to integrate with GCP services, simplify management of resources, use of API’s and overall interaction with the GCP infrastructure. More details can be found on Google’s SDK download page.

Last but not least, for this tutorial it is also important to have Python running on my machine.

4.2 Fundamental Concepts and Definitions

Here I bring you the concepts and also some relevant definitions to better understand the data pipeline.

  • Data: in this tutorial our data will be hardcoded values to simulate a simplified industrial database, where a field sensor sends 3 pieces of informations to the database: timestamp (timestamp format), equipment identification (string format), and the measured value (float format).
  • Big Query: our SQL database, BigQuery, stores tabular data. Two distinct tables were used: one to write simulated data and another to write the “machine learning” algorithm output.
  • Cloud Function: It is a cloud instance to run our python code, based on triggering events.
  • Pub/Sub: This is GCP’s menssaging service and acts recieving and delivering information. Beyond that, it also decouples components, facilitating independent operation of writing new data, and creating outputs.
  • Topic/Subscription: These are concepts within Pub/Sub architecture— Topics deposit data (something like where you post your mensage), and subscriptions defines the recipients. Both together organize the dataflow, ensuring efficient communication.
  • Cloud Scheduler: is a time-triggered counterpart to Cloud Function, which orchestrates the messaging system, initiating the execution of the coud functions.

DISCLAIMER: these are not formal definitions, but rather minimal concepts within the scope of this project. I encourage you to learn more about each of these concepts on Google pages and other articles (links bellow).

4.3 Connecting the dots

Now we have the basic setup and also all the necessary concepts to make it happen. This section focus on how these concepts becames connected tools to create the pipeline.

Keep in mind that names between quotation marks were chosen by the author and are not built in GCP functionalities. Naming was done to keep it as self explanatory as possible.

  • One Cloud Scheduler: Serves as the master trigger, the Cloud Scheduler simulates the signal to send data to our database. It activates every hour, initiating the Pub/Sub topic “write_data_to_bq-topic.”.
  • “write_data_to_bq-topic” and “write_data_to_bq-sub”: Acts as a communication bridge, the Pub/Sub topic “write_data_to_bq-topic” publishes a message to the subscription “write_data_to_bq-sub.”
  • Cloud Function “write_input_data”: Subscribed to “write_data_to_bq-sub,” this function triggers every time it receives an activation signal. Its primary role is to write data to our BigQuery InputTable simulating data input.
  • “ml_simulation-topic” and “ml_simulation-topic-subscription”: Upon completion of the “write_input_data” execution, the topic “ml_simulation-topic” publishes a message to the “ml_simulation-topic-subscription”.
  • The Cloud Function “ml_simulation_function”: subscribed to “ml_simulation-topic-subscription” this function triggers upon receiving a message. This function queries the input data table, executes our simulated “machine learning” algorithm and writes outputs into another the BigQuery table .
Data stream pipeline architecture (image by author)

4.4 Deployment on GCP

Most of the deployment work is done on the command prompt to avoid using GCP pages interfaces. Using the web interface would require lots of print screens and would probably soon be outdated with updates on the platform. But be aware that most of this part (if not all of it) could be done directly on GCP interface.

To make things easier, I will be reffering to this GitHub repository I made for this project to easily walk throught the process. The Python codes used will also be printed here.

Some variables, functions and file naming are with respect to this repository and are relevant information for deploying GCP functions, so keep that in mind if you’d like to follow allong.

First things first, it’s necessary to login to GCP. For that, open the command prompt and run the gcloud authorization code:

gcloud auth login

Then create the GCP project. Take note of the project name and project ID, becaused they will be necessary in the future scripts.

gcloud projects create PROJECT_ID --name="PROJECT_NAME" # Replace with the desired project id and name

After that, it is a good practice to go to IAM & Admin section and make sure the service account has Editor/Owner Role. This allows for creating functions, managing big query tables, etc.

Next it is important to enable the following APIs: Big Query, Cloud Pub/Sub, Cloud Functions, Cloud Scheduler and Cloud Build. This will allows us to deploy and interact with our GCP architecture using bash commands.

gcloud services enable bigquery.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable cloudfunctions.googleapis.com
gcloud services enable cloudscheduler.googleapis.com
gcloud services enable cloudbuild.googleapis.com

Now it is time to create the BigQuery tables. This can be done by running the following codes:

bq mk IndustrialDataset  # creates dataset
bq mk --table ^
--schema timestamp:TIMESTAMP,tag:STRING,value:FLOAT ^
IndustrialDataset.InputTable # creates InputTable with schema
bq mk --table ^
--schema timestamp:TIMESTAMP,tag:STRING,value:FLOAT,prediction:FLOAT ^
IndustrialDataset.TablePrediction # creates TablePrediction with schema

Now to create the Pub/Sub topics and subscriptions these codes can be ran on command prompt:

gcloud pubsub topics create write_data_to_bq-topic  # Creates topic
gcloud pubsub subscriptions create write_data_to_bq-sub ^  # Creates subscription
--topic=write_data_to_bq-topic # Select topic to subscription
gcloud pubsub topics create ml_simulation-topic  # Creates topic
gcloud pubsub subscriptions create ml_simulation-topic-subscription ^  # Creates subscription
--topic=ml_simulation-topic # Selects topic to subscription

Next, it is necessary to create three python scripts: one to deploy the cloud function, one to simulate data input to BigQuery, and the last one for the machine learning algorithm and write data to BigQuery.

For the deployment to GCP, be aware that both python scripts for cloud functions must be in separated directories, and the “.py” file must be named “main.py”. Beyond that, each directory must also have a requirements.txt file containing the necessary python packages/libaries. For this example the requirements should contain the following GCP packages:

google-api-core==2.15.0
google-auth==2.25.2
google-cloud-bigquery==3.14.1
google-cloud-core==2.4.1
google-cloud-scheduler==2.12.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
google-cloud-pubsub==2.16.0

The following code configures the cloud function for writing data to BigQuery:

# Function to simulate data input to BigQuery

import datetime
from google.cloud import bigquery, pubsub

def write_input_data(event, context):
clock = datetime.datetime.now()

# Your Pub/Sub topic for ML simulation
project_id = "apt-entropy-410721" # Your Project ID
topic_name = "ml_simulation-topic" # Your Topic name

# Create a BigQuery client
bq_client = bigquery.Client(project=project_id)

# Define the schema for the BigQuery table
schema = [
bigquery.SchemaField("timestamp", "TIMESTAMP"),
bigquery.SchemaField("tag", "STRING"),
bigquery.SchemaField("value", "INTEGER"),
]

# Prepare data to be inserted into BigQuery
rows_to_insert = [
{
"timestamp": clock, # Timestamp
"tag": '227FC001.PV', # Identification
"value": 207, # Constant value for simulation purpose.
}
]

# Create a reference to the BigQuery table
table_ref = bq_client.dataset("IndustrialDataset").table("InputTable")
table = bq_client.get_table(table_ref)

# Insert data into BigQuery
errors = bq_client.insert_rows(table, rows_to_insert, selected_fields=schema)

if errors == []:
print('New rows have been added to BigQuery')
else:
print(f'Encountered errors while inserting rows to BigQuery: {errors}')

# Publish a minimal message to the ml_simulation-topic with random numeric data (Pub/Sub requirement)
pubsub_client = pubsub.PublisherClient()
topic_path = pubsub_client.topic_path(project_id, topic_name)

# Random number as data menssage
random_numeric_data = 42
print(f'random number: {random_numeric_data}')

# Convert the number to bytes before publishing
data_to_publish = str(random_numeric_data).encode('utf-8')

pubsub_client.publish(topic_path, data=data_to_publish)
print(f'Message published to ml_simulation-topic with numeric data: {data_to_publish}')

To deploy this script, navigate to the directory where this python script was created on the command prompt and deploy the write_input_data function as follows:

gcloud functions deploy write_input_data ^  # Deploys function 'write_input_data'
--runtime python310 ^ # Defines Python 3.10 for function runtime
--trigger-event providers/cloud.pubsub/eventTypes/topic.publish ^ # Determines function trigger as menssage published to topic
--trigger-resource write_data_to_bq-topic ^ # Defines 'write_data_to_bq-topic' as trigger topic
--allow-unauthenticated # Allows unnauthenticated acces to function for ease of use

For the cloud scheduler, create another python file with the following script:

# scheduler_config.py

import datetime
from google.cloud import scheduler

def configure_scheduler():
# Schedule the tasks
scheduler_client = scheduler.CloudSchedulerClient()
parent = "projects/apt-entropy-410721/locations/us-central1" # Replace with your project and location

# Schedule data insertion task
topic_name = "projects/apt-entropy-410721/topics/write_data_to_bq-topic" # Replace with your tapic name
job = {
"name": f"{parent}/jobs/data-insertion-job",
"schedule": "0 */1 * * *",
"pubsub_target": {
"topic_name": topic_name,
"data": b'{"action": "write_input_data"}',
},
}

data_insertion_job = scheduler_client.create_job(parent=parent, job=job)

if __name__ == "__main__":
configure_scheduler()

Above, the cloud scheduler was configured to run every one hour. You can change this frequency replacing “0 */1 * * *” value in job dictionary. The frequency is in cron syntax (more on cron syntax in the references bellow).

Then create your cloud scheduler running the previous python code using:

python scheduler_config.py  # Runs function to deploy cloud scheduler.

Next step is to change to a new directory and creating the following python code for simulating data operations:

from google.cloud import bigquery, pubsub_v1

def ml_simulation_function(event, context):


# Extract latest values from InputTable
bq_client = bigquery.Client()
dataset_id = 'apt-entropy-410721.IndustrialDataset' # Insert your 'project_id.dataset_id'
input_table_id = 'InputTable' # Insert your table id
prediction_table_id = 'TablePrediction' # Insert your table id

# Query the latest row from InputTable
query_latest = f"SELECT * FROM `{dataset_id}.{input_table_id}` ORDER BY timestamp DESC LIMIT 1"
query_job = bq_client.query(query_latest)
latest_row = next(query_job.result())

# Simulate machine learning (adds numbers in this example for purpose of simplyfication)
simulated_result = latest_row['value'] + 10

# Insert the values into TablePrediction
prediction_table = bq_client.get_table('apt-entropy-410721.IndustrialDataset.TablePrediction') # Insert your table address

# Create a new row with the required fields
new_row = {
'timestamp': latest_row['timestamp'],
'tag': latest_row['tag'],
'value': latest_row['value'],
'prediction': simulated_result
}

Note that in this example instead of an actual machine learning algorithm, a addition operation is done to represent data manipulation, for the sake of simplification.

The final step is to change to the directory where the ml_simulation_functions script was created and run the following code to deploy it:

gcloud functions deploy ml_simulation_function ^  # Deploys function 'ml_simulation_function'
--runtime python310 ^ # Defines Python 3.10 for function runtime
--trigger-event providers/cloud.pubsub/eventTypes/topic.publish ^ # Determines function trigger as menssage published to topic
--trigger-resource ml_simulation-topic ^ # Defines 'ml_simulation-topic' as trigger topic
--allow-unauthenticated # Allows unnauthenticated acces to function for ease of use

5. Conclusions

I can be very tempting to go through the code lines and deploy everything, but I can assure you that you can learn a lot more from absorbing the concepts and trying to develop another prototype that makes sense to your reality.

With this tutorial, you are now capable of implementing the backbone of your pipeline, generating and operating with data. Maybe it’s time to build on top of this, and start incrementing other concepts of data security, CI/CD, containerization and others.

Also it is important to note that there isn’t just one way to build a streaming data pipeline. In fact there are many ways a similar pipeline could be developed within GCP framework. Not to mention using other cloud providers.

I hope you found this story inspiring, or didatic in some way. Also that it helps in getting a better understanding of data pipelines and Machine Learning Operations. To me it was a very fun experience, and it drove me to continue further exploring the Data Science Universe!

6. References

https://cloud.google.com/bigquery?hl=pt_br#cloud-data-warehouse-to-power-your-data-driven-innovation

https://cloud.google.com/functions?hl=pt_br

https://cloud.google.com/pubsub/docs/overview?hl=pt-br

https://cloud.google.com/scheduler?hl=pt-br

https://medium.com/@suhasthakral/cron-job-explained-in-the-simple-way-example-of-offset-also-included-353ccb118c04

--

--