How to transfer BigQuery tables between locations with Cloud Dataflow
Update 28.10.2018: I’ve changed the solution so that it now automatically detects and reads the schema of the source table. This means you don’t need to bother specifying the schema in the YAML config anymore. However, it doesn’t support complex schemas with nested fields yet.
Update 05.11.2018: I’ve added the ability to automatically create the target dataset in BigQuery in the correct location/region.
A few weeks ago this article popped up on the Google Cloud blog. It describes a solution that uses Cloud Composer (Google’s fully-managed Apache Airflow service) to copy tables in BigQuery between different locations.
You see, currently BigQuery has a bit of a funky limitation that doesn’t allow you to (easily) copy tables between different locations. For example, if you have a dataset located in the EU and you want to copy a table from that dataset to another dataset that’s say in the US, it will spit out the following error when you try to to do it:
As a result of this, users need to implement gnarly workarounds until it’s natively supported (cough-cough, wink-wink, nudge-nude my BigQuery PM friends).
Aside from a few exceptions — which you’re better off ignoring anyway — the rules for copying tables in BigQuery are:
- You cannot directly copy a table from location X to location Y. For example, you can’t copy a table from the US to EU, or Tokyo to EU — you get the idea.
- Instead, you must first export the table to a GCS bucket. That bucket must also be in the same location as the table. Oh dear.
- You’re still not done. Nope, sorry. Not even close. You then need to copy that export into another GCS bucket that is in the same location as the BigQuery dataset that you’re trying to copy to. Then you need to load it from there. Is your head starting to hurt just like mine yet? Don’t worry, it will pass folks.
Recapping the Cloud Composer solution
Let’s quickly recap what the proposed solution using Cloud Composer in the original article looks like, shall we.
- Fire up a Cloud Composer environment. Currently, there’s a minimum of three nodes required in the cluster, and it also takes about 20 minutes for the lights to come on. Yikes!
- Manually create both a source bucket (for the export) and target bucket (for the load) in GCS — making sure to use the same locations as the BigQuery datasets.
- Create a target dataset in BigQuery where the table(s) will be copied to.
- Run an Airflow job that reads a list of tables to copy from a CSV configuration file, exports the BigQuery table(s) to the GCS source bucket, copies the contents from that bucket to the target bucket, and finally imports the data into source dataset — thereby “copying” a table(s) between locations.
While this solution will indeed work and scale, I couldn’t help but thinking that a somewhat neater (and cheaper?) solution might be lurking in the GCP shadows somewhere. Airflow is a great tool, and a great candidate for solving this type of problem, but having to spin up a big cluster just for copying some BigQuery tables didn’t sit quite right with me.
So, I went hunting for something else.
Reaching for the shiny serverless toys
I like serverless. I like ephemeral stuff too. My sys admin skills are awful, so the less time I need to spend on infrastructure the better it is for humanity. Trust me. Using shiny serverless toys also makes me look much smarter than I actually am, and bides me a little more time before I’m unveiled as a computering fraud.
I’ve been using Cloud Dataflow for a many years now. Mainly for shovelling large amounts of data around on GCP, and loading it into BigQuery. It integrates seamlessly with all the other tools on the GCP stack, and most importantly, is completely serverless.
So, I concocted an idea to use Cloud Dataflow to do the copying of tables across different locations, and then orchestrate it all using Cloud Build — which by the way is becoming one of my favourite tools on the stack these days. It’s just awesome.
An end-to-end, serverless solution that scales to zero. High five!
Partnering up Cloud Dataflow and Cloud Build
Here’s what the Cloud Dataflow and Cloud Build disco dance looks like at a high-level. Note: the solution allows for a list of tables to be configured, but for brevity’s sake, I’ll walk through just copying one table:
- I wrote a simple Cloud Dataflow job (a pipeline) that can read an even simpler YAML config file. That config file tells the pipeline which table to copy.
- While bootstrapping itself, the pipeline automatically detects the location of the source table and the target dataset by talking to BigQuery. It then automagically creates the necessary GCS buckets (one for export and one for load) in the correct locations. It also detects schema of the source table so you don’t need to specify it. It also works out if it should create the target dataset in BigQuery. Flippin sweet!
- Then the pipeline spins up a (batch) Cloud Dataflow job to perform the copy. Under-the-hood, Cloud Dataflow isn’t actually copying the table at all. It’s reading it from BigQuery by calling BigQuery’s export API, and then loading it back in again using a different GCS bucket. But, this is all handled for you and abstracted away. I like when things are hidden from me.
- Cloud Dataflow does its thing, shrugs its shoulders, shouts “meh, is that all you got sucker?”, and proceeds to tear down all the resources it used. It also cleans up all files that were copied to GCS during the process. Nice and neat.
- To orchestrate, a simple Cloud Build job that runs the Cloud Dataflow pipeline (Java app). It’s a simple one liner.
My (bad) code
We just looked at the high level solution, so now let’s roll up our sleeves and take a look at the code in more detail. First up, is the basic YAML config. You simply define a list of tables to copy in the format
[PROJECT]:[DATASET].[TABLE] . There’s also some other options you can configure, like number of workers, worker types etc.
Next up, a trivial pipeline using the BigQuery, GCS and Cloud Dataflow APIs. When working with Dataflow I like to use the Java SDK because it’s more mature and has more features than the Python one. Well, that, and because my Python skills are very dodgy at the best of times.
The key points from this snippet of Java code:
- Before dropping into the Cloud Dataflow job itself, the code will automatically check if needs to create the GCS buckets or not. It will also try and auto-detect the schema, and even create the target dataset in BigQuery too.
- The Cloud Dataflow pipeline itself consists of two stages: read from BigQuery, and write to BigQuery. That’s really it.
withCustomeGcsTempLocation()is probably the most important piece of the puzzle. It overrides the default bucket that Cloud Dataflow uses for loading into BigQuery. Remember, the load has to come from the same location as the BigQuery dataset is located.
Finally, Cloud Build to tie it all together. First step of the CI/CD pipeline is check out the code, then run the Java app using a Gradle container.
Easy as 1–2–3.
To trigger the copying process, all we need to do is run some
gcloud builds submit — config=cloudbuild.yaml
..and then sit back and enjoy a well deserved cup of tea. Now remember, the example YAML config I’ve shown has three tables configured to copy (each one is 125M rows and about 10GB), so we should see three lovely little Cloud Dataflow jobs kick off:
- US to EU (multi-regional)
- EU to US (multi-regional)
- EU to Tokyo (regional)
Firstly, you can find all the source code here. Feel free to tear it apart.
Secondly, this article is not suggesting that using Cloud Dataflow is a better approach than using the original Cloud Composer approach. As with everything, there are tradeoffs.
It’s just offering folks yet another solution to choose from, and you should pick which one suits your needs best. I like it because it’s serverless and scales to zero. I also like the fact that Dataflow will clean up all the files and buckets when it’s finished. You also get monitoring and error handling out of the box using Stackdriver.
Finally, you could just use a bash script to do all this, but that’s no fun now is it? However, my good friend Felipe Hoffa promised me that he’d have a crack at implementing it in bash. Of course, the easiest and neatest solution would be if BigQuery just allowed you to copy tables between locations without any restrictions! 😎