Loading Data Into BigQuery With Cloud Functions

When building your data warehouse in BigQuery, you will likely have to load in data from flat files and often on a repeated schedule. This article will walk through how you can achieve this using Cloud Storage, Cloud Functions and BigQuery to create a repeatable, low maintenance solution.

Let’s start by taking a look at Cloud Functions.

So what is Google Cloud Functions?

Cloud Functions is an event-driven serverless compute platform. What this means in practical terms is that Cloud Functions completely abstracts away from the underlying infrastructure so you can focus on writing code to respond to events. In addition to not having to provision a server, Cloud Functions scales automatically to meet the demand of your users.

Cloud Functions can be triggered by HTTP requests, Cloud Storage events or even Cloud Pub/Sub messages to name a few. Furthermore, you only pay for Cloud Functions when your function is triggered, which makes it a cost effective component to include in your solutions.

Loading data using Cloud Functions

Let’s assume that we receive a CSV file every hour into our Cloud Storage bucket and we want to load this data into BigQuery. Traditionally, we might have tackled this task by scheduling a CRON job to perform periodic checks on the bucket and triggering a BigQuery load job if there are any new files in the bucket.

However, we can create a more event-driven solution by using Cloud Functions to trigger a BigQuery load job whenever a new file is added to our Cloud Storage bucket:

Loading data into BigQuery from Cloud Storage using a Cloud Function

By following this approach we can load our data into BigQuery in a more timely and lightweight fashion without the overhead of running and managing a CRON job as, let’s be honest, who wants to manage another CRON job?

On to the code

First, download the code locally by cloning the following repository to your machine:

git clone https://github.com/asaharland/functions-bq-load-job.git

All of the code for this example is located in index.js. Open index.js and look at line 26:

const jobMetadata = {
    skipLeadingRows: 1,
    writeDisposition: 'WRITE_APPEND'
};

This metadata simply tells our BigQuery load job how to behave when loading in our data:

  • skipLeadingRows: The number of rows we want to skip before our actual data starts. In our case, we want to skip the header row so we specify one row.
  • writeDisposition: This tells BigQuery how the data should be written to our table. In this instance, we are telling BigQuery to append any new data to any existing data already stored in BigQuery.

The load job is created using this short piece of code:

bigquery
  .dataset(datasetId)
  .table(tableId)
  .load(storage.bucket(file.bucket).file(file.name), jobMetadata)
  .catch(err => {
    console.error('ERROR:', err);
  });

This simple segment of code tells BigQuery where we are loading our data from and where we want to write our data to. That’s pretty much it in terms of the code!

Deploying our example

  1. Follow the Cloud Functions quickstart guide to setup Cloud Functions for your project.

2. Create a Cloud Storage Bucket, replacing BUCKET_NAME with your desired bucket name:

gsutil mb gs://BUCKET_NAME

3. Run the following gcloud command, replacing BUCKET_NAME with the name of your bucket from the previous step, to deploy your function and associate it with a Cloud Storage trigger:

gcloud functions deploy loadFile --trigger-resource gs://BUCKET_NAME --trigger-event google.storage.object.finalize

4. Create a BigQuery dataset called ‘finance’ and a table called ‘transactions’ with the following schema:

user_id:STRING,amount:FLOAT

5. Upload ‘sample.csv’, located in the root of the repo, to the Cloud Storage bucket you created in step 2:

gsutil cp sample.csv gs://BUCKET_NAME

6. If successfully deployed, you should see the following logs in Stackdriver Logging once you have uploaded the file to Cloud Storage:

Also, you should see the data in your BigQuery table:

Snippet of data loaded into BigQuery via a Cloud Function.

With this deployed, we now we have a repeatable and robust data load process for loading data into BigQuery. In the next article we will look at how we can handle micro-batching in BigQuery.

Let me know if you have any questions!