IoT Data Pipelines in GCP, multiple ways — Part 1

Published in

Google Cloud - Community

8 min readOct 9, 2019

Welcome to an introductory series to IoT Data pipelines in Google Cloud Platform.

My goal is to show you some of the different ways to get data from your device and analyse it.

In the series I assume that by now you already have a basic knowledge of Google Cloud Platform and its tooling. I will also assume that you already have a Google Cloud Project Billing Account created, or have credits.

As an example we will be utilising a simplified version of an Electric Scooter with an IoT device in it.

But when it comes to building these data pipelines in real life, there are multiple things to consider,

Such as:

IoT device constraints / functionality
Number of Devices
Number of sensors and volume of output, output intervals
Other cost factors

Basically you have to consider what kind of device and data you are dealing with and what you want to accomplish in the long run and how much it can cost.

The choices might be different when you have tens or hundreds of devices that send data to you periodically and which aren’t too noisy (amount of data per minute is lower), compared to when you have thousands and hundreds of thousands of devices which all send you loads of information every second.

In this example series we will focus on a use case, where our IoT device is sending us few triggered signals and periodical position and battery data. The simplest things you need to have to be able to follow the devices journey.

Things we will not cover in this series (at least for now) is Cloud IoT, Dataprep and Bigtable.

Without further ado, let’s start by preparing our project and setting up some of the common services we will be using among all the different variations.

We will be preparing mostly everything from the command line using the GCPs command line tool gcloud. We will also have a bunch of environment variables we will be using.

All of the examples we are running during these series can be found from the support repository: https://github.com/jerryjj/iot-pipelines-series

Let’s start by defining some of the common environment variables we will be relying on

export GCP_PROJECT_ID=[use-your-own-project-id-here]
export GCP_REGION=europe-west1 # Select a region suitable to you

Now let’s enable some APIs we are going to need

ENABLE_APIS=(
"pubsub.googleapis.com" \
"bigquery-json.googleapis.com"
)gcloud services enable --project=$GCP_PROJECT_ID ${ENABLE_APIS[@]}

Here we tell to the GCP project that we want to use certain APIs and products in our project.

Let’s have a look what they are:

Cloud Pub/Sub

Data ingestion is the foundation for analytics and machine learning, whether you are building stream, batch, or unified pipelines. Cloud Pub/Sub provides a simple and reliable staging location for your event data on its journey towards processing, storage, and analysis.

BigQuery

BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in.

So what we need to do next is to prepare our Cloud Pub/Sub topic. We will use this topic to publish the data from our IoT devices.

export PS_TOPIC_ID=device-signals

gcloud pubsub topics create $PS_TOPIC_ID \
--project $GCP_PROJECT_ID

Now in these examples we are simplifying things, in real life you might have multiple different topics for different events and different handlers for all those events. We are now just focusing on a simpler use case.

So now we are ready to start receiving data to our pipeline, but we are still missing a few of the core things to allow us to move forward.

What kind of data will we receive
How should we process that data
Where should we store that data

So let’s start answering some of those questions.

Firstly we define the data that our IoT device will be sending to our Pub/Sub topic.

Again to simplify, we will be using JSON from our device to our topic.

The payloads will have the following properties:

“d”: Unique Device ID
“t”: Epoch timestamp of when this event has occurred
“i”: Unique Ride ID
“e”: Name of the event/signal of particular payload
“b”: Devices battery level in percentage
“l”: Devices GPS coordinates
“p”: Flag telling us if the device is powered on or not

Here are the samples of the different signals we will receive:

{ "d": "0001", "t": "1566465608", "i": "abcd123", "e": "poweron", "b": 100, "l": "61.123456,24.123456", "p": 1 }{ "d": "0001", "t": "1566465638", "i": "abcd123", "e": "gps", "b": 90, "l": "61.323456,24.323456", "p": 1 }{ "d": "0001", "t": "1566465668", "i": "abcd123", "e": "poweroff", "b": 80, "l": "61.423456,24.423456", "p": 0 }{ "d": "0001", "t": "1566465698", "i": null, "e": "gps", "b": 79, "l": "61.423456,24.423456", "p": 0 }{ "d": "0002", "t": "1566465728", "i": null, "e": "alarm", "b": 50, "l": "61.423456,24.423456", "p": 0 }

What we can see from here is that we have 4 different signals we need to be able to handle,

PowerOn signals will be triggered every time the scooters ride is started.
PowerOff signals will be triggered every time the scooters ride is ended.
GPS update signals are triggered frequently when the device is Powered ON and less frequently when it is Idle.
Alarm event gets triggered when the device is Powered Off, but it’s accelerometer is activated (device is moved or it fell down)

Devices current battery level and latest location information is included in each of the message for simplicity in this case. There is also unique id generated for each PowerOn signal and it is sent with every signal until a PowerOff event occurs.

Let’s try to send some dummy data and subscribe to the topic.
First, subscribe to the topic by running the following command:

gcloud pubsub subscriptions create test_sub \
--topic $PS_TOPIC_ID \
--project $GCP_PROJECT_ID

Then run the following command to send dummy event to the topic:

gcloud pubsub topics publish $PS_TOPIC_ID \
--project $GCP_PROJECT_ID \
--message 'hello, pub/sub'

Then lets pull data from out subscription:

gcloud pubsub subscriptions pull test_sub \
--project $GCP_PROJECT_ID \
--auto-ack

Finally let’s remove our test subscription:

gcloud pubsub subscriptions delete test_sub \
--project $GCP_PROJECT_ID

So now we know what kind of data we are expecting from the devices to the Pub/Sub topic, we still need to figure out how we are going to process and store that data. Before heading to the processing parts, let’s configure our storage.

Setting up BigQuery to be our data warehouse

For our BigQuery we need to define our Dataset, Table and Schema for that table.

Let’s start by defining some environment variables again:

export BQ_DATASET_ID=devices
export BQ_TABLE_ID=signals

Then let’s define our schema for the table by creating a following file table-schema.json

table-schema.json -file

Now we can create the Dataset and the Table

# Create Dataset in EU region
bq --project_id=$GCP_PROJECT_ID mk --data_location EU $BQ_DATASET_ID# Create our table using the defined schema
bq --project=$GCP_PROJECT_ID mk \
--display_name="devices signal data" \
--time_partitioning_type=DAY \
--time_partitioning_field=timestamp \
--table $BQ_DATASET_ID.$BQ_TABLE_ID table-schema.json

Few things to note here, we defined the table schema in JSON format and named our fields a bit more descriptive. And as you can see we also included some fields that are not in our incoming data `latitude` and `longitude`. We will get to them on the next part of our series.

Also we defined the table to be partitioned by Day, this will give us benefits in the long run.

NOTE: In this article series we predefine our table schema as we know our input.
There is a way to achieve these same things without having any control about the incoming data structure, but that belongs to its own article.

Now before we move on to processing our data and storing it to our BigQuery, let’s test that we can answer some of the questions we want to ask from our data later on.

To do that we initialise some dummy data that fits our tables schema and run some test queries towards it.

Common SQL for the following BigQuery snippets

So we want to get answers to at least the following questions at any point in time:

How many completed rides our devices have been on
What kind of journey has the devices been on (with location trace)
How many devices have low battery
How many unique devices are there
etc.

Next we will do some queries to be able to answer these questions, I will not go too deep in explaining BigQuery Query language at this point, but as you will see, it is just Standard SQL.

We are running the queries from our projects BigQuery Dashboard to be able to examine the results.

To be able to run these queries yourself, you need to copy-paste the above query and attach the following query after it and then run it from the Dashboard.

Question 1: How many completes rides we have

SELECT
  ANY_VALUE(starts.device_id) AS device,
  starts.rid, MIN(starts.timestamp) AS start_time, ANY_VALUE(starts.lat) AS start_lat, ANY_VALUE(starts.lng) AS start_lng,
  MAX(ends.timestamp) AS end_time, ANY_VALUE(ends.lat) AS end_lat, ANY_VALUE(ends.lng) AS end_lng,
  TIMESTAMP_DIFF(MAX(ends.timestamp), MIN(starts.timestamp), SECOND) AS duration_secs,
  (MAX(starts.bttr) - MIN(ends.bttr)) AS battery_usage_percent
FROM starts, ends
LEFT JOIN location_updates ON location_updates.rid = starts.rid
WHERE
  ends.rid = starts.rid
  AND location_updates.rid = starts.rid
GROUP BY rid

Question 2: What is the battery level of certain device and when was it last seen

SELECT
  device_id, timestamp AS last_seen, battery_percentage
FROM dummy_data
WHERE
  device_id = '0001'
ORDER BY timestamp DESC
LIMIT 1

Question 3: From how many unique devices we have received a signal certain month (October)

,params AS (
  SELECT
    '2019-10' AS search_month
)SELECT
  COUNT(DISTINCT(device_id)) AS devices
FROM params, dummy_data
WHERE
  FORMAT_TIMESTAMP('%Y-%m', timestamp) = params.search_month

To test a simple visualisation, let’s use a tool called BigQuery Geo Viz. Open new browser tab to https://bigquerygeoviz.appspot.com/ and paste in the following query

WITH dummy_data AS (
  SELECT * FROM unnest(
    array<struct<device_id string, ride_id string, timestamp TIMESTAMP, event_name string, latitude float64, longitude float64, battery_percentage int64, power_on_status bool>>[
      ("0001", "123456", TIMESTAMP("2019-10-07 13:28:30.000 UTC"), "poweron", 60.1696993, 24.9294322, 88, true),
      ("0001", "123456", TIMESTAMP("2019-10-07 13:29:00.000 UTC"), "gps", 60.16962, 24.9288, 86, true),
      ("0001", "123456", TIMESTAMP("2019-10-07 13:29:30.000 UTC"), "gps", 60.16958, 24.92813, 84, true),
      ("0001", "123456", TIMESTAMP("2019-10-07 13:30:00.000 UTC"), "gps", 60.16969, 24.92074, 82, true),
      ("0001", "123456", TIMESTAMP("2019-10-07 13:30:30.000 UTC"), "poweroff", 60.1680235, 24.9222142, 81, false),
      ("0002", "123457", TIMESTAMP("2019-10-07 13:29:00.000 UTC"), "poweron", 60.1796993, 24.9394322, 20, true),
      ("0002", "123457", TIMESTAMP("2019-10-07 13:29:30.000 UTC"), "gps", 60.17962, 24.9388, 18, true),
      ("0002", "123457", TIMESTAMP("2019-10-07 13:30:00.000 UTC"), "gps", 60.17958, 24.93813, 14, true),
      ("0002", "123457", TIMESTAMP("2019-10-07 13:30:30.000 UTC"), "gps", 60.17969, 24.93074, 10, true),
      ("0002", "123457", TIMESTAMP("2019-10-07 13:32:00.000 UTC"), "poweroff", 60.1780235, 24.9322142, 4, false)
    ]
  )
)SELECT
  FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", timestamp) as ts, event_name, device_id, ST_GeogPoint(longitude, latitude) AS wkt
FROM dummy_data
WHERE
  ride_id = "123456"
ORDER BY timestamp DESC

Then click, “Run” and “See Results” buttons, you will be redirected to review the results, the defaults should be fine (“Geometry column” should have “wkt” as value), go ahead and click add styles. Here we want to distinguish our waypoints with different colors so we now what the points on the map mean. To do this change the “fillColor” settings to match something like this:

Now we have a visualisation of a single ride. How cool is that?

This concludes our first part of this series. Next up we start to actually ingest the data and transform and store it to our BigQuery table.

Thank you for reading and here is the link to Part 2.

IoT Data Pipelines in GCP, multiple ways — Part 1

Setting up BigQuery to be our data warehouse

Written by Jerry Jalava