Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 2: Setting up BigQuery Transfer Service and Scheduled Query.

Muhammad Izzuddin
TheLorry Data, Tech & Product
6 min readJan 2, 2021

--

Welcome to Part 2 of Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery. You can check out Part 1 on how to set up a batch data pipeline using AWS Datapipeline before you begin this.

As described in Part 1, our approach for this batch data pipeline is ELT. We extract data from AWS RDS, Load inside Bigquery, and Transform the data inside Bigquery using SQL.

The key point to note in the ELT process is that raw data is transformed inside the data warehouse without the need for a staging server; your data warehouse now contains both raw data and transformed data. We prefer our pipelines this way at TheLorry. The Analytics Setup Guidebook is a great resource if you’re interested in deep dive into these data analytics/engineering core concepts including many others.

Let’s get started with setting up our pipeline for BigQuery. Basically, to develop our pipeline inside Bigquery, there are 2 main steps to be followed, which are:

  1. Transfer data from AWS S3 into Bigquery (on daily basis).
  2. Schedule query to join different table (fact and dimension tables) (on daily basis) and append to a single denormalized table

Before you begin, make sure that you have access to Bigquery. You can refer to BigQuery documentation regarding access control in BigQuery.

In terms of costing, BigQuery is pretty generous where you get around 10 GB of data storage and 1TB of an on-demand ad-hoc query on monthly basis for free. You can refer to this pricing structure to get comprehensive details on how BigQuery is charged.

1. BigQuery Transfer Service

Let's focus on building our batch data pipeline inside GCP’s Bigquery

Inside Bigquery console, click ‘Transfers’

Click ‘create transfer’

Choose Amazon S3 as your data source

Set the name of your Transfer config and then set the schedule option. At the point of writing this blog, BigQuery Transfer service UI only allows daily extraction (no hourly or minutes transfer available yet).

Select ‘Destination ID’ which you can manually create inside Bigquery UI. Select your ‘Destination table’, Note: You need to create your destination table before-hand with matching schema and fields format ofthe tables you want to extract from S3. Any mismatch schema or fields format will cause the ‘Transfer service’ to break.

Key in your S3 data URL inside ‘Amazon S3 URI’

Insert your ‘AWS access key ID’ and ‘Secret access key’. (you can refer to this guide on how to retrieve your AWS access key ID and Secret access key)

You can configure the numbers of errors allowed inside your transfer service under the Transfer Option. For our case, we wanted all the data thus we set ‘Number of errors allowed’ to be 0.

For transfer failure notification of the transfer, you can either set email notification or can be notified through ‘pub/sub’ topic. Once finished, click ‘Save’.

Here is the sample of BigQuery Transfer Service Dashboard after you create it.

If you faced any problem in your DataTransfer, you can check the issue from the logs panel. Here is an example of a problematic transfer.

To further diagnose the problem, you can use Cloud Shell and key in “ bq show -j <your job id>”. It will show you the details error information.

If you choose ‘Number of error allowed’ or ‘Ignore unknown value’ in the Transfer Service configuration, you can also see how many problematic rows are skipped inside the Transfer Log

2. Scheduled SQL Query

We will create schedule queries that join table1 and table 2 to create the final denormalized table. attached here is the sample of the query

saved this query, and click ‘Scheduled Query’ and click ‘Create new scheduled query’

Named the scheduled query. Choose the schedule options (again, at this point of writing, Bigquery Schedule Query only allows up until daily schedule query option). Choose ‘Scheduled start time’ to initiate your scheduled query start datetime, and choose ‘End never’ if you want this scheduled query to run continuously.

Set the destination table for our scheduled query, and choose ‘Append to table’. As new daily extracted data from S3 will be appended to the final denormalized_table. You can partition your DateTime column to improve your query performance. you can read about BigQuery partitioned table here

Again, you can choose failure notification through email or pub/sub topic.

Click ‘Save’. A joined processed table will be generated on daily basis and will be appended to your final denormalized table. From here onward, you can directly query the denormalized table for ad-hoc analytic or connect it to another third party BI tools.

Congratulation, you just created an end-to-end batch data pipeline from AWS RDS to Google Bigquery.

Lessons Learned

I learned quite a lot of lessons while setting up this pipeline and some limitations of this current pipeline imposed. For example, at this time of writing, BigQuery UI did not allow scheduled data transfer to be less than a day.

Our daily scheduled BQ query has its own flaws, the joining process will keep running even though there is any missing data from any table from the ‘Transfer service’ which may affect the accuracy of our final denormalized table. This can be overcome using technology that manages schedule dependency like Airflow or Google Cloud Composer.

We have reached the end of my article; I hope that you have benefitted from my experience of building this ELT pipeline. Do comment if you feel the workflow can be improved. Thanks for reading :)

--

--