BigQuery Utilities for Apache Beam

Cory Tucker
Windfall
Published in
3 min readDec 29, 2018

At Windfall, our infrastructure is entirely hosted on the Google Cloud Platform. Being a data company, we utilize Cloud Storage and BigQuery extensively in our day to day operations. Another tool that we have come to lean on pretty heavily is Google Cloud Dataflow (aka Apache Beam).

We’ve been using Dataflow since the early days of the company, even before it was open sourced as Apache Beam. We use it for many purposes, ranging from relatively straightforward ETL processes that need to scale horizontally, to extremely complicated build processes with many different types of inputs and outputs, and even for training and testing predictive models.

Almost all of the Dataflow jobs that we do involve either reading from or writing to BigQuery at some point within the process. One pain point we found in doing this was the out-of-the-box BigQuery APIs required a lot of repeated data transformations and manual table schema management.

Over time we have developed a collections of utilities to help make it easier to work with these two tools that we utilize so heavily within the organization. Recently, we made these utilities public and open sourced them in their own GitHub repository.

Transforming BigQuery TableRow Objects

If you are using BigQuery as an data source, then you’re likely using the BigQueryIO class to read the data. The problem now becomes transforming the data from BigQuery into a more useful Java object — no one wants to go passing along TableRow objects through all their PTransforms.

We made a really simple PTransform that will convert from TableRow objects into any Java object you like: MapTableRow.

Converting TableRows to something more usable

Writing Object Graphs to BigQuery

From my experience, the biggest pain when using BigQuery as a data sink (write destination) is declaring — and maintaining — the table schema definitions.

You added a field? ... go change the schema

You changed that field type? … go change the schema

You removed a field? … go change the schema

You forgot to change the schema? … wait potentially hours for your job to get to that step and then fail.

We created a solution for this problem by creating an annotation you can add directly to the Java classes you want to write to BigQuery. Place the annotation on any of the fields or methods and they will automatically get added to the TableSchema.

By using the BigQueryColumn annotation, you won’t need to worry about maintaining your TableSchema definitions any more when writing to BigQuery.

Take a look at the project GitHub page for additional features, more information, and examples. Suggestions for improvements and PRs are most welcome!

--

--