Running SQL queries on Google Pub/Sub (streaming) data

Published in

DoiT International

4 min readMay 28, 2020

Would you like to query the data flowing through Google Pub/Sub with plain SQL? How about creating these queries directly from Google BigQuery UI, without writing any custom Java or Python Dataflow SDK code — and not worrying about deployment?

Step right up and try Google’s new feature for Dataflow - Dataflow SQL.

Setup

Let’s go through some steps to create a very simple Dataflow SQL pipeline.

To get started we’ll need to create a few things:

A user account with appropriate IAM permissions
BigQuery dataset to create tables in
BigQuery table to store data in
A Google Cloud Pub/Sub topic for data ingestion.

Create a dataset:

Create a table in that dataset:

Create a Pub/Subtopic:

Now that we have everything created, go ahead and switch over to the new Dataflow SQL engine.

To switch over, go to Query settings:

Select Cloud Dataflow engine:

Observe the new resources available:

In order to query the Google Cloud Pub/Sub topic we created, we need to assign it a schema. Creating one is simple enough. Three string fields plus the incoming timestamp which Google Cloud Pub/Sub adds by default.

Create Schema for topic:

Now it is time to create a Dataflow SQL job.

Job Creation

Create a simple SQL statement, which will run as a Dataflow job.

Click “Create Cloud Dataflow” job. Now we can specify some output locations. This is a super helpful feature that allows us to save results when done, giving us the full power of an ETL process.

Let's have a look at the Dataflow DAG we created from SQL. Go to job history in BigQuery:

Click on Job ID and then the DAG can be seen:

It is a very simple one, just applying the single transform and then dumping into a BigQuery table via streaming inserts. Simple yet powerful, an ideal Dataflow Job.

Play with our streaming pipeline

From here we can now test our pipeline creation. Let’s put some dummy JSON data into our topic and see what our pipeline does.

The pipeline we created should take this JSON data, and place only the JSON field “a” into our table test.

Success! We have filtered incoming data from Google Cloud Pub/Sub into a table in BigQuery without ever having to leave BigQuery — while writing it all in SQL.

This illustrates how easy it is to set up and deploy Dataflow pipelines with Standard SQL syntax. We have only done a very simple query but more complex ones are possible.

Originally published at https://fluxengine.ltd.