Batch Dataflow Pipeline

In the second part of this series, we will initialize a sample PostgreSQL database that will act as our data source from which we will read rows and migrate them to BigQuery.

If you haven’t already checked Part I, you can read it here.

Data Source — Chinook Database

Throughout this series, I’ll be using the Chinook database, more specifically, the PostgreSQL version, as the data source. You can find database scripts for different database vendors here.

Below is the ERD (Entity Relationship Diagram) of the tables that we will be interacting with.

BigQuery Database

The following are the dimensions and facts tables we will create under BigQuery.

Dimensions

- d_tracks
- d_customers
- d_dates

Facts

- f_invoices

As you might have noticed. The number of tables under BigQuery is less that of that under PostgreSQL, and the column names convention has changed from PascalCase to snake_case. That’s not really a rule of thumb; however, I’m intentionally doing this to demonstrate complex transformations logic that will be executed inside our batch pipelines.

Note: The d_dates dimension I’ll be using is based on this gist. To import this to BigQuery simply run follow these steps:

- Run the SQL script aginst any PostgreSQL database
- Export the table data as CSV
- Upload the CSV to Google Cloud Storage
- Create a new BigQuery table using the uploaded CSV ( BigQuery can auto detect the schema for you )

If you don’t have PostgreSQL installed locally, you can download the CSV file here

BigQuery ERD

Note: There’s no notion of foreign keys in BigQuery

Batch Pipeline

Now that we have our data source ready, and we have a clear idea of what data will look like under BigQuery, let’s see some code.

Let’s go through the required Java libraries and the utility classes that will be used inside of our pipeline.

Java Libraries

build.gradle

AppOptions

A Custom DataflowPipelineOptions. Here we will define custom pipeline options specific to our app. (Ex: PostgreSQL host, username, etc…).

AppOptions.java

TableSchemaBuilder

BigQuery table schema builder. This class abstracts the methods used to define a BigQuery table schema.

TableSchemaBuilder.java

TableRowMapper

Maps a HashMap to a BigQuery TableRow.

TableRowMapper.java

AlbumTableSchema

A class describing the BigQuery table schema that will store Albums.

AlbumTableSchema.java

TableRowCoder

PostgreSQL TableRow to HashMap Coder.

TableRowCoder.java

In the next part of this series, we will build our first batch pipeline to backfill BigQuery albums table using data stored under PostgreSQL.

--

--