Historical load into Ingestion Time partitioned BigQuery table using Dataflow

Published in

Analytics Vidhya

6 min readDec 7, 2019

1. Introduction

In my current organization we rely heavily on Google Cloud’s services to run our high scale data pipelines. We use PubSub to ingest close to 120k events per second and store them into BigQuery for analytical and reporting needs. We ingest click stream data from our mobile apps and website and make that data available for querying via BigQuery with as lowest latency as possible. Click stream data is the best indicator of user behavior and is vital for several downstream processes.

We receive JSON events from mobile and web SDK, and historically we used to store complex data types (Arrays and Nested objects) as strings in BigQuery columns. Owing to this, all the downstream processes had to parse these column values either within BigQuery or application to fetch the values they were interested in. This caused a lot of duplication of code and effort among different teams which relied on this data. Storing these complex fields as string also meant there was no validation on the values inside them and it was a data quality issue as well.

Recently we decided that we’ll leverage BigQuery’s RECORD and REPEATED types to store these values to retain the correct shape of the data. We tweaked our Dataflow pipelines to start parsing the complex types and push it to BigQuery. Since the new schema was not compatible with old tables, we had to create new downstream tables with new schema.

After implementing this change, the challenge was to batch load the historical data into these newer tables. All of our tables in BigQuery are ingestion time partitioned and batch loading historical data into newer tables would mean either we loose our partition information or load data partition by partition using Table decorators. We couldn’t afford to mess with the partition time as lots of our downstream processes are sensitive to data in certain partition hence we decided to take the other route.

While I was implementing this one off batch pipeline I came across a nifty little API in Beam called SerializableFunction which let us return a custom table name based on records in current window. Using this feature meant I could load entire historical data in one go without looping over partitions.

In this post I’m going to share how to use SerializableFunction to load data into correct partitions of a BigQuery table.

2. Setup

For purpose of this post I’m going to setup two tables (1 historical and 1 new). The historical table will have JSON data stored as string while newer tables will have RECORD and REPEATED type columns. Both of these tables will be ingestion time partitioned.

Historical Table

We’ll be using BigQuery console to create the table. You can use bq command line tool or API as well to create the table.

Our historical table has following columns :

id: Unique ID to identify Customer

attributes: Attributes about the customer, e.g. name, age etc.

client_timestamp: Timestamp when the event happened on the device.

We’ll use client_timestamp to identify the correct partition in the new table.

Next we’ll insert some historical data into this table. Currently inserting data using Table decorators is not supported in the UI hence we’ll write a small python script to insert data into our historical table.

Install the BigQuery client library using pip.

pip install google-cloud-bigquery

Then execute following python script to load 2 records in two different partitions.

Run the script and you should see New Rows have been added. in the output.

Once inserted, double check that data has been inserted into correct partitions by executing SQL from Console.

SELECT *  FROM `just-landing-231706.test_dataset.historical_table` where DATE(_PARTITIONTIME) = '2019-11-30'SELECT *  FROM `just-landing-231706.test_dataset.historical_table` where DATE(_PARTITIONTIME) = '2019-12-01'

You should see 1 row in each query. This confirms that our data has been inserted into correct partitions.

It might take some time for your rows to show in correct partition. Streaming inserts are first put into buffer before getting put into correct partition.

New Table

Let us create our new table now.

The new table has attribute column as RECORD type with name and age as fields.

3. Dataflow Job

With table setup out of the way let us write our Dataflow job to load data from historical_table to new_table.

If you want to setup your system for Dataflow development and job submission, please refer to my earlier blog post.

Hello World in Dataflow — The Bohemian Approach

This weekend I was taking Google’s serverless course on Coursera where they introduced Dataflow, a fully managed…

medium.com

You can clone the Dataflow job from my repo . I’ll walk you through the pieces of code one by one. Do keep in mind to change the project and dataset variable to your values.

kaysush/dataflow-bigquery-historical

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

The main section that you need to pay attention to is the anonymous initialization of SerializableFunction class in BigQueryIO.write.to() function call.

We need override apply() function of SerializableFunction class and return a TableDestination object. What we do in this function is fetch the client_timestamp from the TableRow object and create the tableName with Table decorator $ and correct partition value. So a sample value for tableName will be projectID:dataset.table_name$partition_date. Using the table name we return an instance of TableDestination.

Another important piece of code is the call to withFormatFunction. This function also takes input a SerializableFunction but instead of TableDestination, it returns a TableRow.

Here we fetch all the fields of historical_table, parse the JSON in attributes field and create a output table row with age and name as separate RECORD field.

Run this program by passing in following program arguments.

--tempLocation=gs://<YOUR-BUCKET>/temp/ --project=<YOUR-PROJECT-ID>

The program should successfully execute.

Try querying the new_table with specific _PARTITIONTIME. You should see both the rows in correct partitions.

And there you have it. We successfully did the upload of historical data into correct partitions even when our table was ingestion time partitioned.

4. Conclusion

SerializableFunction is a powerful construct in BigQueryIO and is useful when you wish to route your rows to different BigQuery tables. However one thing to note is that all of those tables need to have same schema. If you wish to write rows into different tables with different schemas, there is a different construct named DynamicDestinations. With DynamicDestination you don’t just return the name but also a schema for the table. You can read more about them here and I’ll try to cover them in my upcoming post as we use them quite heavily in our data pipelines.

If you find any error in my codes or have any question or suggestion, please feel free to drop a comment below.

Till then, Happy Coding! :)