We created an e2e serverless data pipeline using a scheduled service that gets data from OpenWeatherMap and stores it into BigQuery. The data can be explored directly on BigQuery or Data Studio.
Well, it’s known the importance of data for large to small organizations. Gartner says that by 2022, 90% of corporate strategies will consider data as a critical enterprise asset. Besides, the concept of serverless is becoming one of the golden rocks of modern computing. Indeed, -let me paraphrase Dylan Stamat, CEO of Iron.io- one of the most important reasons why companies are moving to serverless concerns to the fact of its nature of cost and use optimization (they want to spend less and do more!).
Motivating your hearth
Regarding the title of this short-to-read article/tutorial, you are here maybe because:
- you love working with new technologies related to data,
- it is just your job to work with it,
- or perhaps you are part of a secret society that believes in the power of the Datum God.
Whatever the option, we mind data. A similar jargon could be used to talk about serverless. Furthermore, they both combined form the perfect double for new data engineers’ tool belt, who have excellent dev skills and don’t want to worry about managing and operating servers (see: AWS talking about serverless).
What will you read here?
Keeping in mind your time, let’s get to the point. This article is a how-to reading in which we are going to build a serverless data pipeline that regularly obtains data from an API/service and loads it into an analytics engine to be explored by business users or data scientists.
Our tech stack is conformed by some serverless services of Google Cloud Platform:
- Cloud Function: Our code will obtain data from an API/service and will load it into BigQuery, being subscribed to our execution topic of Pub/Sub.
- Cloud Pub/Sub: Our execution topic that will trigger our Cloud Function.
- Cloud Scheduler: This service will allow us to schedule the execution of our Cloud Function, invoking the Cloud Pub/Sub trigger.
- BigQuery: The data obtained by the Cloud Function will be loaded here. It is our analytics engine, allowing us making simple or complex queries and data processes.
- Data Studio: Our reporting tool for the analyst or data explorer, It will consume data from BigQuery. Although It is not part of GCP (and is a SaaS app), I love it ♥️!
Along with GCP, the role of API/service will be covered by the current weather API by OpenWeatherMap. I chose it because of its free plan, also is one of the most popular APIs on that thing called The Internet. Moreover, I decided Santiago, Chile, as the geographic point to get data from this API. However, you have to consider that this is just an exemplification; you may need an according-to-your-job service to gather data.
The next image outlines our reference architecture.
Hands-on (ur ♥️)
This section will empower your data engineer’s tool belt.
Pre-requisites for a good travel
Before, you will need this as pre-requisite:
- A GCP project with a linked billing account. Don’t worry about this; if you specify a gentle frequency of execution, the entire pipeline will cost zero.
- Installed and initialized the Google Cloud SDK.
- An App Engine app activated in your project. Why?
- Enabled the Cloud Functions, Cloud Scheduler, and APP Engine APIs.
- An API key from OpenWeatherMap.
- Unix based command line interpreter. Remember that you can use the Cloud Shell on GCP!!
Step 1: Clone the repo
You have to clone the repo that contains the function code. Also, you can pass to salute!
git clone https://github.com/jovald/gcp-serverless-data-pipeline.git
Step 2: Setting up environmental variables
I follow -and take- the specialization courses of GCP on Coursera, that bring me to the world of excelling the use of env variables while working with the command-line tools of the GCP SDK. Hence, let’s set some variable:
PROJECT_ID: Your project ID, you can find it on the GCP Console. Example:
# Take care of the id
TOPIC_NAME: This is the topic name for Pub/Sub. I set something like:
JOB_NAME: This is the job name to be executed on Cloud Scheduler. I set something like:
FUNCTION_NAME: This is the name of the function on index.js; in this particular scenario, the name is loadDataIntoBigQuery, so…
SCHEDULE_TIME: This is the frequency of execution, be gentle here. For instance, you can set it as:
export SCHEDULE_TIME="every 1 hour"
You can read more about this in the docs of Cloud Scheduler.
OPEN_WEATHER_MAP_API_KEY: Sign up on OpenWeather, and generate an API key here. The key takes some time to be ready to use. Meanwhile, you can try to call the API through Insomnia (Debug APIs like a human, not a robot).
BQ_DATASET: This is the BigQuery dataset name. I set something like:
BQ_TABLE: This is the BigQuery table name. I set something like (very repetitive):
Step 3: Activate the GCP project
Activating the project helps to execute the next commands more quickly:
gcloud config set project $PROJECT_ID
Step 4: Create the Cloud Pub/Sub topic
Let’s create a Pub/Sub topic called TOPIC_NAME:
gcloud pubsub topics create $TOPIC_NAME
Step 5: Create the Cloud Scheduler job
This command will create a Cloud Scheduler job, named as JOB_NAME, that is going to send a message through the Pub/Sub TOPIC_NAME every SCHEDULE_TIME (frequency).
gcloud scheduler jobs create pubsub $JOB_NAME --schedule="$SCHEDULE_TIME" --topic=$TOPIC_NAME --message-body="execute"
Step 6: Create the BigQuery dataset and table
bq mk $BQ_DATASET
Step 7: Create the BigQuery table
The BQ_TABLE will contain our data:
bq mk --table $PROJECT_ID:$BQ_DATASET.$BQ_TABLE
Step 8: Deploy the Kraken! Sorry… our function
Finally! The glue of our pipeline, our function, you can read the next code as:
“Ok, Google, deploy the function called FUNCTION_NAME that will be triggered by the topic TOPIC_NAME and don’t forget to set up some env variables for this lovely service.”
gcloud functions deploy $FUNCTION_NAME --trigger-topic $TOPIC_NAME --runtime nodejs10 --set-env-vars OPEN_WEATHER_MAP_API_KEY=$OPEN_WEATHER_MAP_API_KEY,BQ_DATASET=$BQ_DATASET,BQ_TABLE=$BQ_TABLE
Our service is running in a bold and robust way
There are some crucial parts to name on the function code:
We will use the temporary space of Cloud Functions as our launch zone. A more scalable approach could use Cloud Storage as a middle space for the data.
We are setting variables containing our env variables passed in the deployment command.
We are appending data into BigQuery in JSON format.
You always have to delete the temporary space files. “Files that you write consume memory available… and sometimes persist between invocations. Failing to delete these files explicitly may eventually lead to an out-of-memory error” Delete it!
How to end this pipeline as a real queen or king of data?
This particular stage depends totally on the data or insights you want to obtain. Felipe Hoffa illustrates different use cases and ideas using BigQuery; you can read his articles on Medium!
Query your table
Two options (clearly more).
First, remember the env variables? They are still util. If you run the next command, a query will be executed to count all the records on your table. If you complete the steps above correctly, you will see at least one entry.
bq query --nouse_legacy_sql "SELECT COUNT(*) FROM $BQ_DATASET.$BQ_TABLE"
Second, BigQuery on the GCP Console is also an enjoyable manner to explore and analyze your data.
Data Studio, the grand finale
Day to day, Google’s technological ecosystem proliferates. This project is a small -but concise- proof of how completed could be an end to end data solution built into this ecosystem.
I built a report on Data Studio and was a great and fast experience. Look it, just 20–30 minutes of learning by doing, and is connected directly to BigQuery! A big wow.
What about making the deployment process with fewer commands? The next article will be shorter (promise), with a similar focus, however, smarter, with Terraform.
Who inspired this article
- Simple serverless data pipeline on Google Cloud Platform
- Streaming data from Cloud Storage into BigQuery using Cloud Functions