Building a Data Platform on GCP
How to build a modern data platform with high availability, low latency, and scalability on the free tier of Google Cloud Platform.
This article is Part One of a multipart series consisting of:
- Building a Data Platform on GCP
- GCP Infrastructure & Authentication
- Google Cloud Pub/Sub Messaging Service
- Containerization using Docker
- Google Cloud Run Jobs & Scheduler
- Google BigQuery Cloud Database
- Google Cloud Analytics
Articles part 1 to 6 utilize batch files that automate all of the CLI steps discussed in each article. You can use them to automate your deployment with a versioned system. These batch files are available in the GitHub repository.
What You Will Learn
- How to install and use the Google Cloud CLI & SDK
- GCP infrastructure & authentication
- Google projects & billing
- User-managed service accounts & impersonation
- How to test and minimize assigned roles for Google services
- Google Cloud Pub/Sub messaging using Python
- Containerization using Docker
- Google Artifact Registry and repositories
- Pushing a Docker image to a Google Artifact Registry repository
- Google Cloud Run Jobs & Scheduler Jobs
- Google BigQuery cloud database
- Querying data from a BigQuery cloud database table using Python and Pandas to analyze it.
- How to script Python PIP, Docker, Google CLI, and BigQuery CLI from within a Windows batch file. Articles part 1 to 6 have batch files that automate all of the CLI steps discussed in each article.
- How to run Python, Docker, Google CLI, and BigQuery all from the same Windows command window.
What I learned:
- The Google CLI is a powerful tool that allows you to do just about everything available in the Google Console GUI. By putting the commands into scripts, you can automate deployments and most importantly, version them.
- The Google CLI documentation is seriously lacking. For example, nearly every command for a service (Pub/Sub, Artifact Registry, Cloud Run Jobs, Cloud Scheduler Jobs, etc) requires the — region or — location argument, but that is not indicated everywhere. And why do some service commands require “ — region”, and others “ — location” when the value is exactly the same? The poor documentation for Cloud Scheduler Jobs caused me 1.5 days of exhaustive trial and error to figure out the inputs required.
- The documentation and examples for Google SDK for Python have serious errors as well. But once you employ the library correctly, it works well.
- The services themselves such as Pub/Sub, Artifact Registry, Cloud Run Jobs, Scheduler Jobs, and BigQuery are powerful and robust.
- The Google authentication system is a painful to understand and implement in a production viable way. Much has been written about Application Default Credentials (ADC) and the Google Cloud SDK, but I couldn’t find any working examples from start to finish. I spent weeks of trial and error figuring out how ADC should work for production. I hope this article takes the pain out of working with Google ADC for someone. The details steps I provide make it easy to configure ADC, assign minimal roles, test them, and enable the required APIs.
- Docker and the Google Cloud Platform were completely new to me.
- BigQuery is a powerful and easy to use cloud database system (DBMS). A CLI (command-line tool) is available, and the Google SDK provides a Python library for interaction with the database.
- I was disappointed with Google Looker Studio from an analytics capability perspective. I expected more.
- I was pleased to be able to easily and directly query a BigQuery database table, extract data, and write it directly to a Pandas dataframe. This allowed me to perform analysis on the time series data using familiar tools in a Python environment.
- You can run Python, Docker, Google CLI, and BigQuery all from the same Windows command window once your system is properly configured.
Background
In August 2024 I read the article “Building a Data Platform in 2024” by Dave Melillo on Medium. At the same time, a friend advised me that she was working with two sister companies though their digital transformation journey. I thought this would be a great time for me to explore building a modern data platform using the insights from Dave Melillo. Below is his vision of that modern data platform.
Project Functional Specification
As always, before I start any project, I created a list of the functional requirements for the “data platform”.
- SaaS. A managed cloud based (SaaS) data platform with high availability, low latency, and scalability.
- Free. Any component used for the data platform must have a free tier duration of at least one year.
- EaC. Both configuration and execution must be done using the Everything as Code (EaC) concept. Code can be versioned and more easily tracked than screen shots of a UI. Furthermore, everything supplier specific in the code should be abstracted so that you can choose a new supplier or tool later without excessive pain.
- Raw Data. Incoming raw data from multiple regions worldwide as both batch and streaming. The regions must include the U.S., Asia, and Europe. The data packet must include a mix of basic data types (datetime, string, integer, float), and the number of data values must be adjustable (channels) for testing at various sizes. Metadata about each streaming data source and the data content should be defined and accessible by the data platform, eliminating the need to send it with each data packet. For data analysis purposes, some of the streaming raw data channels should have an inherit correlation between them that can be measured over time, others with no correlation, and some with a cyclical pattern. The streaming rate should be adjustable down to 100 ms.
- Integration. Must accept raw data as streaming, batch, or event triggered and EaC.
- Storage. The data storage must be cloud based and can be a data lake or data warehouse.
- Transformation. Include a variety of data types and include conversions through the orchestration process.
- Presentation. An interface to the data platform (“Data Platform Interface”) must exist to provide business users with easy access to the data. Ideally this should be an API driven by code, but it may be a web based user interface (UI).
- Analysis. Demonstrate how data science can be used to make predictions and recommendations based on data available through the Data Platform Interface.
- Cost Model. Utilize cost modeling tools, or develop a cost model for any paid services with the capability to estimate the cost for the service at a different scale.
Implementation Overview
The streaming source data will be generated by a Python script running on a cloud virtual machine (VM) or compute service. It will publish the data to a cloud message service.
Another Python script running on a cloud virtual machine (VM) or compute service will fetch messages from the cloud message service, extract the data, clean it, transform it, and then write it to cloud storage.
The interface to the data platform must be available as a CLI (command line interface) and GUI (graphical user interface).
Analysis of the data in storage will be demonstrated by a Python script using Pandas.
First Trial v0.0
I started with Amazon Web Services (AWS) and ran a Python script on an AWS EC2 instance running in Africa that streamed messages to an Upstash Kafka server. The AWS EC2 instance was a virtual machine running Linux.
The first significant thing I learned from this AWS experience was that while I could create AWS S3 buckets programmatically in Python, I could not create the required credentials in code. As a result, I ended up using the AWS UI a lot to created the required credentials for my AWS S3 bucket and EC2 instance. It would have been necessary to repeat this process multiple times in order to create publishers for multiple regions (unless I used an image).
Although I selected a free tier for the AWS EC2 instance consisting of a virtual machine running AWS Linux, I was still charged $0.21 USD for the 8.9 hours I had the instance running. This amounts to about $17/mo if it was running full time. Not exactly free.
Snowflake
I took a serious look at building the data platform on Snowflake, but the short 30 day free trial dissuaded me. Snowflake has a lot of services to explore (which is why I was interested in it), but I didn’t think even a person in my situation (retired) would be able to try them all in 30 days. I tried looking at what the cost would be after 30 days, and even for my very small scale project, the monthly price confirmed my decision not to pursue Snowflake.
If you as a SaaS supplier really want to expand your customer base, offer a unlimited, or at least a year long free trial. Your sales growth should be through customers that have made a commitment to employing your service for production, NOT through the evaluation of the suitability of your service.
Google Cloud Platform v1.x
The next attempt was to create a data platform using services and resources from the Google Cloud Platform (GCP). Google has a generous free tier with no time limits, only usage limits. You can see all that is available by visiting https://cloud.google.com/free.
The major components of a GCP based data platform will consist of:
- Source Data. Raw streaming data generated by a Python script publishing the data to Google Pub/Sub. The Python script will run in Google Cloud Run Jobs and scheduled by the Google Scheduler Jobs.
- Integration. A Python script running in Google Run Jobs and triggered every two minutes by Google Scheduler Jobs will subscribe to the messages in Google Pub/Sub, extract the data, clean it, and then write it to Google BigQuery.
- Storage. Google BigQuery database
- Presentation. Pandas pulling data from Google BigQuery
Installation & Configuration
Docker requires Windows 10 or 11 64 bit. See Docker installation requirements. The Google Cloud CLI works on Windows 8.1 and later and Windows Server 2012 and later.
You need to install Python, Google CLI & SDK, and Docker in order to continue with the articles that follow. These installation details are specific to the Windows OS, but much of it is applicable to Linux and macOS.
Article Code Block Conventions
Note that whenever you see a code block as listed below and a command is preceded by the $ sign, this means the command should be executed within a Windows command prompt window. Do not include the $ symbol in front of the command. A comment denoted by a hash # will also be shown before most commands. Do not enter the hash # or the comment text in the Windows command prompt window. Some examples:
# Get the Google CLI version
$ gcloud version
# Get the Docker version
$ docker --version
# Get the active Python version
$ py -V
Install Python
I had the choice of using a REST API, or the Python SDK for Python scripting. I chose to use the Python SDK.
In order to use the Python scripts I am providing on GitHub, you need to install Python v3.12.x and the Python library installer PIP.
Execute the commands below from a Windows command prompt window to confirm Python 3.12 is installed, and to install/upgrade PIP.
# Show what Python versions are installed and where
$ py --list-paths
# Install PIP
$ py -m ensurepip --upgrade
For more details about the Python installation and using Python, visit my website SavvyCodeSolutions or my article “Python Beginner Guide — Part 1”.
Google Cloud
You must have a Google Cloud user account with billing enabled. By following these articles, you will be using the Google Free Tier services, but you still need to have a payment option configured. Login to or create a Google user account. From the Google Admin Console, configure billing.
Google Cloud CLI & SDK
Download the Google Cloud CLI Installer. When you run the installer, note the destination folder, the unselect the option “Bundled Python” since you already have Python installed, and make sure the option “Cloud tools for PowerShell” is selected. The option “Cloud SDK Core Libraries and Tools” will be selected and you cannot change that option.
The final installation screen will give you options for adding shortcuts, starting the Google Cloud SDK Shell, and running “gcloud init”. Don’t enable the two options for starting the Google Cloud SDK shell or running “gcloud init”.
If you experience any issues with the installation, visit Install the Google Cloud CLI. I had issues installing it into a VM running a new install of Windows 10 Home. What resolved my problem was to restart the machine, and then go into the folder C:\Users\[username]\AppData\Local\Google\Cloud SDK\google-cloud-sdk\ and then run install.bat file. I also had to manually add the C:\Users\[username]\AppData\Local\Google\Cloud SDK\google-cloud-sdk\bin\ folder to my path (settings -> environment variables -> Path).
Note that when you install the Google Cloud CLI, the Google BigQuery command-line tool is also installed. The use of Google BigQuery and the command-line tool is discussed in the subsequent article “Google BigQuery Cloud Database”.
Find the Windows icon for “Google Cloud SDK Shell” and run it. A CMD window will open up. This is your Google Cloud CLI (command line interface), also known as gcloud. Verify it is working properly by entering the command:
$ gcloud version
When the ..\AppData\Local\Google\Cloud SDK\google-cloud-sdk\bin\ folder is in your Windows user path (PATH environment variable), you can execute a Google CLI (gcloud) command from any folder. This is the default installation configuration.
Docker
If you don’t already have Docker installed, see the article “How To Install Docker on Windows? A Step-by-Step Guide” and the Docker installation page for assistance.
On the Docker installation page , you need to choose “Docker Desktop for Windows..” and either the “x86_64” or the “ARM (Beta)” option. To determine if your system is x86_64 or ARM open up a Windows command prompt window (CMD) and type in “systeminfo”. Look for the item “System Type”. If it is “x64-based PC” then choose the “x86_64” Docker install option. If it is “x86” then you have a 32-bit system and cannot use Docker. If the system type is “ARM64” then install the “ARM (Beta)” Docker option.
Verify your Docker installation by executing the following in a Windows command prompt window (CM):
$ docker --version
$ docker run hello-world
The default installation adds C:\Program Files\Docker\Docker\resources\bin\ to the user’s Windows PATH environment variable. This allows you to run a Docker CLI command from any folder while in a Windows command prompt window.
QuickStart — gcp_part1.bat
Everything that follows in this article beyond the installation of Python, the Google Cloud CLI / SDK, and Docker can be done automatically by a Windows batch file named “gcp_part1.bat” included in the GitHub repository. Download that repository and put the contents into a folder named “gcp” off of your user Documents folder.
IMPORTANT: You must edit the key/value pairs in the file gcp_constants.bat The defaults are okay except the following two must be edited to match your Google settings:
- GCP_USER=username@gmail.com
- GCP_BILLING_ACCOUNT=0X0X0X-0X0X0X-0X0X0X
Open up a Windows command prompt window, navigate to that folder, and then execute the batch file “gcp_part1.bat” by simply typing in “gcp_part1” followed by the enter key.
$ gcp_part1
Alternatively, you can execute each of the Python and Google Cloud CLI (gcloud) commands in that Windows command window manually as described in next sections. The benefit of the details that follow is each command is explained in context.
Create a Python Virtual Environment
From the Windows CMD window, create a folder for the Python virtual environment named “gcp” (Google Cloud Platform) and change the working directory to that folder
# Change to the user's Documents folder
$ cd /D %USERPROFILE%\documents
# Create a folder for Python and virtual environments and change to it
$ mkdir Python\venv\gcp
$ cd Python\venv\gcp
Download the files from GitHub and place them into the “gcp” folder. From the Windows command prompt window that you should still have open, run the batch file named “gcp_part1”.
$ gcp_part1
The batch file in the Windows command prompt window will execute the commands listed below. You may do this manually if you prefer.
# Navigate to the parent folder "venv" of the folder "gcp"
$ cd ..
# Make sure PIP is installed / upgraded to the latest for Python v3.12
$ py -3.12 -m pip install --upgrade pip
# Create a virtual envrionment named "gcp"
$ py -3.12 -m venv gcp
# Change the working directory to "gcp"
$ cd gcp
# Activate the Python virtual environment
$ scripts\activate
# Show the current Python version
$ py -V
# Use PIP to install the following Python packages:
$ py -m pip install - upgrade google-cloud-pubsub
$ py -m pip install --upgrade google-cloud-bigquery
$ py -m pip install --upgrade google-cloud-bigquery-storage
$ py -m pip install --upgrade numpy
$ py -m pip install --upgrade pandas
$ py -m pip install --upgrade db-dtypes
$ py -m pip install --upgrade matplotlib
# List the installed Python packages
$ py -3.12 -m pip list
# Create the file requirements.txt that lists all PIP installed packages
$ py -m pip freeze > requirements.txt
# Deactivate the Python virtual environment
$ scripts\deactivate
# Display the Python installation paths
$ py --list-paths
# Create a permanent environment variable CLOUDSDK_PYTHON for the
# Google CLI & SDK.
$ setx CLOUDSDK_PYTHON "EDIT_THIS_WITH_THE_PYTHON312_PATH"
# Use the "exit" command to close the Windows command window.
exit
Open a Windows command prompt window again and execute the following command to confirm that the environment variable CLOUDSDK_PYTHON is configured.
set CLOUDSDK_PYTHON
Make sure the Windows command prompt window current folder is “..\venv\gcp”.
Find the Windows icon for “Google Cloud SDK Shell” and run it. A CMD window will open up. This is your Google Cloud CLI (command line interface), also known as gcloud. Every subsequent command referenced in the articles that follow that begin with “$ gcloud” should be executed in this window. Do not include the $ symbol with any gcloud commands.
Configure CLOUDSDK_PYTHON
In the Google Cloud SDK Shell CMD window, set the environment variable CLOUDSDK_PYTHON to point to the Python v3.12 executable. You can find that location by executing the command:
py --list-paths
Next, set the environment variable CLOUDSDK_PYTHON to point to the appropriate folder using the Windows SET command as shown below.
setx CLOUDSDK_PYTHON %USERPROFILE%\AppData\Local\Programs\Python\Python312\python.exe
The command SETX will permanently add the environment variable CLOUDSDK_PYTHON to the user’s Windows OS environment.
SETX CLOUDSDK_PYTHON Alternative
If you don’t want to create a permanent user Windows OS environment variable, you can alternatively set the CLOUDSDK_PYTHON in the batch file named “cloud_env.bat” located in the folder shown before the command prompt in the Google Cloud CLI window. The folder location should be something similar to:
C:\Users\[username]\AppData\Local\Google\Cloud SDK
The “cloud_env.bat” file is executed when you run the Google CLI/SDK Shell and it should look similar to what is shown below. I modified my file slightly as shown and suggest you do the same:
ECHO OFF
CLS
SET PATH=%USERPROFILE%\AppData\Local\Google\Cloud SDK\google-cloud-sdk\bin;%PATH%;
rem The SET CLOUDSDK_PYTHON below must be configured for the Python executable version you wish to use.
SET CLOUDSDK_PYTHON=%USERPROFILE%\AppData\Local\Programs\Python\Python312\python.exe
rem Make sure GOOGLE_APPLICATION_CREDENTIALS are NOT set
SET GOOGLE_APPLICATION_CREDENTIALS=
cd %USERPROFILE%\AppData\Local\Google\Cloud SDK
ECHO Welcome to the Google Cloud CLI! Run "gcloud -h" to get the list of available commands.
ECHO ---
ECHO CLOUDSDK_PYTHON = %CLOUDSDK_PYTHON%
ECHO GOOGLE_APPLICATION_CREDENTIALS = %GOOGLE_APPLICATION_CREDENTIALS%
ECHO ---
ECHO ON
gcloud config list
One Windows Command Prompt Window
When your Windows environment is properly configured as I have recommended, you can run Windows CMD, Python, Python PIP, Docker, Google CLI, and BigQuery command-line commands in the same Windows command prompt window. The key to this is:
- Install Python, the Google CLI / SDK, and Docker as stated.
- For Python, activate the virtual environment from the virtual environment folder named “gcp” in my examples, and then call “py” (py.exe) to execute Python.exe.
- Create a permanent Windows OS environment variable for the user environment for CLOUDSDK_PYTHON that points to the Python v3.12 executable.
Conclusion
You now have Python, PIP, the Google CLI, and the Google SDK installed. Your OS should be configured to allow you to run Python, PIP, the Google CLI, and the Google BigQuery command-line tool from any Windows command prompt.
In the next article “GCP Infrastructure & Authentication”, I will show you how to configure Google authentication so that a Python script will run locally, in a local Docker container, and on the Google Cloud Platform in a VM using a user-managed service account and impersonation.
The next two topics in this article provide reference information for the subsequent articles.
Project Constants
Below are the various constants that will be referenced in the articles that follow. Adding a semantic versioning reference to the PROJECT_ID (v0–0) allows you to easily delete a complete project, and create another new project.
Python version: 3.12.x
PROJECT_ID: data-platform-v0-0
GCP Region: us-east4
User Account: username@gmail.com
Service Account: svc-act-pubsub@data-platform-v0-0.iam.gserviceaccount.com
(the prefix for the above is "svc-act-pubsub")
TOPIC_ID: streaming_data_packet_topic
SUBSCRIPTION_ID: streaming_data_packet_subscription
Google Artifacts Registy repository: repo-data-platform
Google Run Jobs:
data-platform-pub-run-job
data-platform-sub-run-job
BigQuery:
DATASET_ID: ds_data_platform
TABLE_ID: tbl_pubsub
Related Links
Google Cloud Resource Hierarchy
Google Cloud APIs, Project REST API,
GCP geographic management of data
Google Cloud setup checklist
Terminology
API — Application Programming Interface.
AWS — Amazon Web Services provides SaaS such as storage (S3) and computing (EC2).