PubSub streaming into BigQuery just got a whole lot easier . . .

This new functionality will simplify a lot of data pipelines

Lee Doolan
Qodea Google Cloud Tech Blog
5 min readAug 1, 2022

--

No more dataflow . . . ??

I was delighted on Thursday morning to wake up to a busy CTS Data chat space discussing this article posted by Google Cloud’s Product Manager Qiqi Wu and new functionality added to Cloud Pub/Sub.

Essentially it means there is now no more need for a dataflow, cloud function, or any other middle layer service, required to stream data from Pub/Sub topics into BigQuery.

This is excellent news — but like most things it needs more understanding.

These are some of the things my colleagues at CTS were discussing . . .

Are there any limitations of this functionality? How does this work with existing tables, or varying schemas? Will this save our customers money? How reliable and performant will it be?

Overview

Without over repeating the excellent article, it now allows a direct stream into BigQuery to be set up from a Pub/Sub subscription.

From here you can easily select the target project, dataset and table where Pub/Sub will deliver messages.

Source: https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

Some important and/or other things to consider . . .

  • The BigQuery table must already exist: You must ensure your target table already exists and is configured correctly to receive the incoming stream.
  • Use topic schema / Drop unknown fields: Two options that together indicate the content of the stream and how we handle messages with attributes that don’t have corresponding columns in the target table.

Do we try and map/match the Pub/Sub message attributes to columns in the BigQuery table, or dump the whole message to a column called ‘data’?

If matching we need to ensure the names and data types are in sync. See this guide on schema compatibility which will help with this.

  • Write metadata: An option to add additional and other useful columns to the stream, as below.

Including the subscription name to a stream is perfect for scenarios where you require multiple subscriptions to stream to a single table, but still remain aware of the message source.

Source: https://cloud.google.com/pubsub/docs/bigquery

Limitations & Ideas

This streaming addition is great and will really simplify Pub/Sub to BigQuery streaming, but there are some ‘limitations and workarounds’.

Only use where no transformations are required before landing data in BigQuery
Qiqi Wu’s article indicates that in scenarios where preload transformations are required, such as PII masking for example, we should still opt for Dataflow. But as a CTS colleague below suggests there could still be some considerable benefit using this new functionality, ignoring Dataflow and making use of existing Google Cloud & BigQuery capabilities.

Really impressive! That could simplify lot of ETLs pipelines I think.

Personally, I would still use this new feature and just dump the raw data (with PII) in a BQ raw table (with a lot of security restrictions on top of it), and have a BQ transformation to remove/mask the PII. So, still no need to have a Dataflow job

Sourygna Luangsay, Cloud Architect @ CTS

I fully agree with this sentiment, and this approach would will still follow the: raw → clean → curated data layered design, albeit using BigQuery as part of that raw staging area.

It’s difficult to use where schemas are inconsistent / unknown
In an ideal scenario we’d know the exact format and columns expected of our incoming data but even if not we have the flexibility and tools available to us to manage downstream.

We have the option to load inconsistent and unknown message schemas into a single BigQuery table, using the aptly named ‘data’ column and use the myriad of BigQuery functions to clean, parse, extract attributes from the message body. These could include using json extract or regexp functions, or anything, extracting into columns we are expecting or even the new JSON column data type.

Do more of the T in ELT using dbt / Dataform
Just a final note to add once the data is loaded into BigQuery, even in this raw format as highlighted above, we can start to look at using tools such as dbt & Dataform.

These tools are perfect for transforming data already loaded into your BigQuery platform, making it easy to ‘rerun’ data from these raw data stores as and when we learn more about the incoming schemas.

Advantages

This functionality is a great step forward from a development and cost perspective for a developer and customer.

  • Simpler and easier development: No separate pipeline needs to be developed to ingest data into BigQuery from a PubSub queue. It’s just the simple configuration of subscription.
  • Lower Costs: No dataflow requirement means less up-front development and on-going dataflow running and maintenance operational costs. This could be a fairly significant cost saving.

From the article at https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics:

“you no longer need to pay for data ingestion into BigQuery when using this new direct method. You only pay for the Pub/Sub you use. Ingestion from Pub/Sub’s BigQuery subscription into BigQuery costs $50/TiB based on read (subscribe throughput) from the subscription.”

From the documentation at https://cloud.google.com/pubsub/docs/bigquery:

“Minimizes monitoring. BigQuery subscriptions are part of the multi-tenant Pub/Sub service and do not require you to run separate monitoring jobs.”

Wrapping Up

I’ve always liked the idea of ingesting data into BigQuery as soon as possible, even before analysts and data scientists know it’s coming. It’s part of being a proactive Data Engineer, working with the tech teams and data producers early in the development cycle.

It really helps any future BI and Analytics requirement gathering with real data already sat there to interrogate. This streaming functionality is a great step towards that.

Also, how easy is it now to just configure a subscription for every topic you have, and dump into a single BigQuery table!

I hope this is useful, and please feel free to drop me a line for ideas on future posts.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Lee Doolan
Qodea Google Cloud Tech Blog

Cloud Data Warehouse Architect & Data Engineer | UK Based | https://www.linkedin.com/in/leedoolan77 | Thoughts are my own and not of my employer