Micro-batching with Apache Beam and BigQuery

Asa Harland
Sep 17, 2018 · 3 min read

In the previous article we looked using load jobs to import flat files into BigQuery. However, there are certain limits we have to play by when loading data into BigQuery typically caused by micro-batching data into BigQuery.

The Challenge

In the previous article we looked using load jobs to import flat files into BigQuery. However, there are certain limits we have to play by when loading data into BigQuery, namely, we are limited to 1,000 load jobs per table per day. This article looks at an option for navigating this limit whilst still being able to import your data in a timely fashion.

The Solution

To overcome this limit, we will use Apache Beam running on Cloud Dataflow to periodically check our Cloud Storage bucket for new files and then stream the data into BigQuery. The diagram below outlines how this looks:

The way that it gets around the load job limit is by using something called streaming inserts which allows you to write your files line-by-line into BigQuery. The downside of this is approach is cost; load jobs are free whereas streaming inserts cost $0.01 per 200 MB of streamed data. However, the benefits of this solutions are that:

  • Your data is loaded in a timely manner
  • You can scale up to 100,000 writes per second, and this is only a soft limit
  • BigQuery can query data stored in the streaming buffer.

Let’s take a look at running the example:

The Code

First, download the code locally by cloning the following repository to your machine:

The main code we are interested in is located in StreamingFilePipeline.java:

What this snippet is doing is watching for new files ending in .csv to be added to our Cloud Storage bucket. We want to watch for files indefinitely so we use Watch.Growth.never() so our pipeline never expires.

1. Follow the Cloud Dataflow quickstart guide to ensure your Google Cloud Platform account is configured to run this example.

2. Create a Cloud Storage Bucket, replacing BUCKET_NAME with your desired bucket name:

3. Create a BigQuery dataset called ‘finance’ and a table called ‘transactions’ with the following schema:

4. Run the following command from the root of the project, replacing the text in bold with your values.

If successfully deployed, you should see the following pipeline graph in Cloud Dataflow:

5. Upload ‘sample_1.csv’, located in the root of the repo, to the Cloud Storage bucket you created in step 2:

6. Upload ‘sample_2.csv’, located in the root of the repo, to the Cloud Storage bucket you created in step 2:

7. If everything is setup correctly, you should see the data in your BigQuery table:

Snippet of data loaded into BigQuery via Cloud Dataflow

With this deployed, we now we have a repeatable and robust data load process for loading micro-batched data into BigQuery.

Let me know if you have any questions :)