Dynamically Duplicating A BigQuery DataSet’s Tables…

…Using CloudScheduler, Pub/Sub, CloudFunctions and a little bit of Node.js

Phil Goerdt

--

StackOverflow Gods, why have you forsaken me?

There is so much to love about GCP, and one thing in particular is the flexibility of the platform for developers. Writing in Python? No big deal. Working in Go? No problem. Developing something in Node.js? Easy.

However sometimes you find that a task you’re trying to accomplish isn’t exactly in the library you’re using. Or in the GCP console. And it can’t be found on StackOverflow or GitHub. And thus, you’re required to figure it out on your own and stop asking questions once it finally works.

The crux of the issue at hand is how can I copy all of the tables from a dataset in BigQuery into another dataset, and how can I schedule that to happen on a timed basis? With some blood, sweat, and maybe tears, we can definitely figure this out together.

Why do we want to do this?

I believe that fresh data is always better to develop against than stale data. In the past, we’ve been constrained in a variety of ways that prevented us from using Production-like data or having close data parity to the applications we are developing from and for. This can create many different issues, so my thought was “Why continue in the old ways if we can fundamentally change how we work by working in the cloud?”

Thinking through that, it only made sense to be able to refresh data from production, test, or whatever sandbox we needed into another location for development, testing or data science. It also made sense doing that on a reasonable schedule (say, every day, or even on demand) as opposed to weeks or months. And so here we are, talking about how to accomplish this on GCP.

(This is a sequel to a blog I wrote about creating a better data workflows. You can check that out here.)

How can we get this done?

To start, there are probably several ways to accomplish this task, especially if you count using another language such as Python. In an attempt to keep this short, let’s just stick to what I did. That isn’t to say that you couldn’t adapt this solution to other languages, and if you do, I’d love to hear about it!

The basic premise is pretty simple: write some code and then use Cloud PubSub and Functions to hold it all together. How that actually works looks a little something like this…

  1. Cloud Scheduler is configured to run at the interval we prefer (once a day in this case)
  2. Cloud Scheduler runs and publishes a message to the PubSub topic we create and designate
  3. Cloud Functions picks up the event in the topic, and then runs the Node.js code we have written to copy all of the tables in BigQuery
  4. Everyone is happy!
-It’s like they were designed to work together or something…

BigQuery: Select all, copy and paste!

I’m using Node.js for this solution and with that assumption, be sure that you are using the Node.js client library for BigQuery if you’re following along at home.

There are some great examples provided by the folks at Google for copying a single, hardcoded or user input table from one data set to another. But how do you go about copying an entire dataset’s worth of tables? Voila! What I came up with is below.

'use strict';const bigquery = require('@google-cloud/bigquery')//what are the start and end points
var project_id = '<your project id here>'
var src_dataset = '<the source dataset name here>'
var dest_dataset = '<the destination dataset name here>'
//show the people what we're working with...
console.log('Input parameters:')
console.log('Project ID: ' + project_id);
console.log('Source Dataset: ' + src_dataset);
console.log('Destination Dataset: ' + dest_dataset);
console.log('Getting source tables...')
//create a function to get tables and metadata
//this function is from Google. Thanks Google!
async function listTables(datasetId, projectId) {
// [START bigquery_list_tables]
// Imports the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
// Creates a client
const bigquery = new BigQuery({projectId});
// Lists all tables in the dataset
const [tables] = await bigquery.dataset(datasetId).getTables();
return tables;
}
//create a function that will copy the table from
//one dataset to another
//this function is also from Google. Thanks Google!
async function copyTable(
srcDatasetId,
srcTableId,
destDatasetId,
destTableId,
projectId
) {
// [START bigquery_copy_table]
// Imports the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
// Creates a client
const bigquery = new BigQuery({projectId});
// Copies the table contents into another table
const [job] = await bigquery
.dataset(srcDatasetId)
.table(srcTableId)
.copy(bigquery.dataset(destDatasetId).table(destTableId));
const metadata = {
createDisposition: 'CREATE_IF_NEEDED',
writeDisposition: 'WRITE_TRUNCATE'
};
console.log(`Job ${job.id} completed.`);// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
throw errors;
}
// [END bigquery_copy_table]
}
//
console.log('Attempting to copy...')

async function copy_all() {
//get the tables
const tables = await listTables(src_dataset, project_id);
//copy each table
tables.forEach(table => {
copyTable(src_dataset, table.id , dest_dataset, table.id , project_id)
})
}
//run the function
copy_all();
//have views in your dataset?
//get ready to see some errors in the output.
//have external tables in the dataset too?
//see above.

So, what is this function doing? It will take the project id, the source dataset, and the destination dataset that you input, and find all of tables in the source, and then copy each table into the destination. Pretty neat!

One thing to note is that it only copies table objects. This command will not work on views or external tables that are in your dataset. Let’s stop to think about why that would be for a minute. Tables are actual data that are within BigQuery, and that is what we are copying from one dataset to the other. External tables (Sheets, Cloud Storage, etc.) are outside of the dataset that BigQuery is holding, so it makes sense that it those would not be able to be copied here. Likewise with views, which are actually, well, coded views of that data within the dataset. Heads up: if you have either of those object types, you will see errors in the console. Don’t fret though: the job will still complete, and any tables present will be copied. I don’t mind the errors because it gets the copy tables job done, but if you get upset because there are errors in the console, you could definitely filter only for TYPE='TABLE' in the tables object for each table to clear that up.

This also opens the door for the question of “how do we handle those other object types?” It’s a good question to ask, and I would think that we could stitch something together that would take committed view logic in the Cloud Repository and deploy it with other BigQuery commands. Maybe there should be a sequel to this blog on that topic? Comment below if you’d be interested in seeing that.

Time to get func-y

Now that we have our code ready to go, let’s configure a Cloud Function that will handle executing it for us. As a reminder, Cloud Functions are a serverless way to run code based on events. Let’s walk through how to configure the function in this use case. First, navigate to Cloud Functions within the GCP Console, and then click on “Create Function”.

All fun and games so far…

You can see that I have set some of the options for my function here. I allocated the smallest amount of memory for this function (128MB); select whatever you think is appropriate in your use case, but I would recommend starting small and increasing if you feel the performance is lackluster. The most important things to note here are the Trigger and Topic fields, which tell the Function when to execute. Make sure you select Cloud Pub/Sub in this example, and then select the topic you wish to use. Don’t worry if you don’t have one yet — you can create one in the drop down menu if needed.

Once you’ve configured those options, you will see an inline editor to paste our function logic that we created above. We can’t just copy and paste our logic in; Functions has a specific format that it prefers. Let’s break that down.

exports.myFirstFunction = (event, context, callback) => {
const pubsubMessage = event.data;

<function logic goes here>
callback();
};

Basically what is happening here is that Functions is creating a new function with three arguments. Those are:

  1. The event: this is any event that comes into the PubSub topic we created.
  2. The context: this is the logic that we will provide to the Function to execute.
  3. The callback: tell the Function that we’re done running.

(Please note that the context argument is new to Node.js 8 in Cloud Functions! You do not need this to run a Node.js 6 function.)

Now that we’ve reviewed what is going on in the inline editor, paste the completed logic into the space between our context and callback lines. Once you’ve done that, be sure to update the text box underneath the editor to run the function we have created and exported.

We’re almost there! Above the inline editor box, there is a button that reads package.json . Click on that, and then paste in the following dependencies. This will allow the function to import the relevant modules and packages we need to run the function. This only works for publicly available modules; if you have private modules in a different function, you will need to upload or import your .zip file into the Function!

{
"dependencies": {
"@google-cloud/pubsub": "^0.18.0",
"@google-cloud/bigquery": "^2.0.6"
}
}

Nice work! Go ahead and click on “Create”! Once you’ve created the function, you can test it by navigating to the “Testing” tab and click on the “Test the Function” button.

Are we scheduling or automating?

Answer: Yes.

It’s always been painful to schedule and execute data refreshes between production and development environments, and these are partly why these environments have stale data. Thankfully, we don’t have to live in that world by necessity anymore, and now that we have function that can do that for us, let’s schedule it.

This seems like the perfect time to get our hands dirty with Cloud Scheduler. If you’re unfamiliar with Scheduler, it is a service that will execute an event or call based on a timer. This is extremely handy functionality and it also means that the days of having a VM or App Engine instance running cron just to do this are over. Let’s get started.

From the GCP console, go ahead and navigate to the Cloud Scheduler. Then click on “Create job”. This lands you on the Scheduler’s config page, which is pretty self explanatory. Here, we put in a description, frequency and timezone. We’ve also added the Target, which is our PubSub topic we created earlier. Finally, we put a payload into the scheduler to pass along to the Cloud Function. Once you’ve finished configuring the Scheduler job, click on “Create”.

After the job has been created (which may take a minute or two), you can test the job. Click on the “Run Now” button to test!

Once you’ve tested, you should be able to view logs in each service if you wanted to follow the sequence of events. Most importantly, you should be able to go directly to BigQuery and see the tables replicated from one dataset to the next!

All’s well that ends well

This is just the tip of the iceberg when it comes to Cloud Scheduler and Functions, but the important thing is that it is a brilliant solution if we have data that needs to be replicated between datasets in BigQuery. Whether we need fresh data for ML models, separate datasets for data engineers and report developers, or we are looking for refreshed environments, it’s all possible when we apply cloud thinking to old problems.

Phil Goerdt is the founder of Erteso, a consultancy that focuses on cloud and data solutions, particularly for GCP. He is a GCP Certified Professional Cloud Architect and a GCP Certified Professional Data Engineer. You can contact him at phil.goerdt@erteso.com.

--

--

Phil Goerdt

Consultant. Data & Cloud architect. Founder. Petrolhead. Traveler. Foodie. Geek. Opinions my own.