IoT Data Pipelines in GCP, multiple ways — Part 3

5 min readOct 9, 2019

Welcome to Part 3 of this series. Previously we have configured a Pub/Sub topic to receive information from our IoT devices and setup a BigQuery table to receive that information. We also used Dataflow to ingest, transform and store that data to our BigQuery table.

In this series we will use Google Cloud Functions to do the Ingestion, Transformation and Storing to BigQuery.

Google Cloud Functions is a lightweight compute solution for developers to create single-purpose, stand-alone functions that respond to Cloud events without the need to manage a server or runtime environment.

Let’s dig right in. We will be using Node.js v10 runtime (beta) for our function in this article, but you could write them in any of the supported languages.
And for this example we will also be using the new functions-framework from Google Cloud team.

We will be using background functions as we want the function to be invoked indirectly in response to a message coming to our Cloud Pub/Sub topic.

The big difference with this compared to the Dataflow usage we did in the previous Part 2 is that GCF will scale to zero if there is no traffic. So you only pay for what you use, as in the Dataflow case it currently doesn’t support scaling to zero.

To simplify this article you should clone this series repository from Github https://github.com/jerryjj/iot-pipelines-series and install the requirements as mentioned in the part3/README file.

Let’s go through the important parts.

In file src/signals.js we define our transformation logic before we store the data to BigQuery.
As you can find out it is exactly the same we used previously in our Dataflow example.

In the other files we handle the message parsing from Pub/Sub and Write the transformed data to BigQuery safely.

Something to note also in our src/lib/handlers.js file is that if we get errors when writing to BigQuery we throw an error, this will trigger the function to be retried.

And to protect ourselves from getting accidental duplicates of the same Pub/Sub message to our BigQuery we use the Pub/Subs unique eventId as our writeId to BigQuery.

To learn more about retries, error cases and idempotent cloud function designing I suggest these good articles: Retrying Background Functions & Building Idempotent Functions

Now we need to deploy our function to our GCP project, but first let’s again initialise some of the environment variables required for this. We need to use some of the previously defined ones from the earlier article parts.

export GCP_PROJECT_ID=YOUR_PROJECT_ID
export GCP_REGION=europe-west1
export PS_TOPIC_ID=device-signals
export BQ_DATASET_ID=devices
export BQ_TABLE_ID=signals

We also need to enable a new API for our project again

ENABLE_APIS=(
"cloudfunctions.googleapis.com"
)gcloud services enable --project=$GCP_PROJECT_ID ${ENABLE_APIS[@]}

Next we need to create a Service Account which our function will use when executed. This isn’t really a necessity, but more of a security best practice. Without a custom Service Account your function will be executed with the project wide Editor access and this usually is not desired. This would also enable you to separate your functions and data storage to different GCP projects.

KEY_NAME="pipeline-handlers-sa"gcloud iam service-accounts create $KEY_NAME \
--project=$GCP_PROJECT_ID \
--display-name $KEY_NAMEgcloud projects add-iam-policy-binding $GCP_PROJECT_ID \
--project=$GCP_PROJECT_ID \
--member serviceAccount:$KEY_NAME@$GCP_PROJECT_ID.iam.gserviceaccount.com \
--role roles/bigquery.dataEditor

Now we are ready to deploy our function, to do so, execute the following command

gcloud functions deploy deviceSignalsHandler \
--project $GCP_PROJECT_ID \
--runtime nodejs10 \
--region $GCP_REGION \
--service-account pipeline-handlers-sa@$GCP_PROJECT_ID.iam.gserviceaccount.com \
--set-env-vars BQ_PROJECT_ID=$GCP_PROJECT_ID,BQ_DATASET_ID=devices,BQ_TABLE_ID=signals \
--trigger-topic $PS_TOPIC_ID \
--memory=128 \
--retry

After the deployment is finished, let’s go take a look from the GCP Console.

And again in local terminal let’s start the simulator to send test data to our Pub/Sub topic

DEVICE_COUNT=10 node src/index.js

Now let’s open the logs view for the function and start listening for updates.

Back in the GCP Console we should start seeing messages coming in, processed and stored to BigQuery successfully.

You can verify that the data is really in the BigQuery by running the same query as we did in the previous article.

You can also see that if you stop the simulator and monitor your Function, you will see the instance count scale to 0. This is great news, you don’t have to pay for idle functions if you are not receiving any data.

Now, what would happen if our function wouldn’t be deployed but our devices are still sending information?

Answer is that your signals would be lost!

Reason for this is that Cloud Pub/Sub doesn’t send old messages to new Subscriptions. And because the GCF creates new subscription for the function when deployed, there is no one listening for your incoming signals before.

However, updating your function does not have the same effect, so you are safe to update your functions even though the data is still streaming in.

There, you have it. You now have a Cloud Functions pipeline to process data from your devices all the way to your Bigquery table.

Remember to stop your function so it won’t continue to ingest the messages in the next part of this article series. To do so, stop your simulator and run the following commands

gcloud functions delete deviceSignalsHandler \
--project $GCP_PROJECT_ID \
--region $GCP_REGION

Cloud Functions are priced according to how long your function runs, how many times it’s invoked and how many resources you provision for the function. If your function makes an outbound network request, there are also additional data transfer fees.

If we imagine that we would get constantly 10 messages/second, 24h/day, 30 days/month (26 Million invocations), then this pipeline would have incurred approximate cost of 71 USD/month.

This concludes our third part of this series. In the next part we will discuss how to deploy this same flow using Google Cloud Run.

Thanks again for reading and stay tuned for Part 4.

IoT Data Pipelines in GCP, multiple ways — Part 3

Written by Jerry Jalava