How to backup a BigQuery table (or dataset) to Google Cloud Storage and restore from it

String together calls to the bq command-line utility

Note: this article is obsolete. BigQuery now supports Table Snapshots.

BigQuery is fully managed and takes care of managing redundant backups in storage. It also supports “time-travel”, the ability to query a snapshot as of a day ago. So, if an ETL job goes bad and you want to revert back to yesterday’s data, you can simply do:

CREATE OR REPLACE TABLE dataset.table_restored
AS
SELECT *
FROM dataset.table
FOR SYSTEM TIME AS OF
TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)

However, time travel is restricted to 7 days. There are situations (playback, regulatory compliance, etc.) when you might need to restore a table as it existed 30 days ago or 1 year ago.

Python scripts to backup and restore

For your convenience, I’ve put together a pair of Python programs for backing up and restoring BigQuery tables and datasets. Get them from this GitHub repository: https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup

Here’s how you use the scripts:

To backup a table to GCS

./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup

The script saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.

You can also backup all the tables in a data set:

./bq_backup.py --input dataset --output gs://BUCKET/backup

Restore tables one-by-one by specifying a destination data set

./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset

How the Python scripts work

The scripts use the BigQuery command-line utility bq to put together a backup and restore solution.

bq_backup.py invokes the following commands:

bq show --schema dataset.table. # schema.json
bq --format=json show dataset.table. # tbldef.json
bq extract --destination_format=AVRO \
dataset.table gs://.../data_*.avro # AVRO files

It saves the JSON files to Google Cloud Storage alongside the AVRO files.

bq_restore.py uses the table definition to find characteristics such as whether the table is time-partitioned, range-partitioned, or clustered and then invokes the command:

bq load --source_format=AVRO \
--time_partitioning_expiration ... \
--time_partitioning_field ... \
--time_partitioning_type ... \
--clustering_fields ... \
--schema ... \
todataset.table_name \
gs://.../data_*.avro

In the case of views, the scripts store and restore the view definition. The view definition is part of the table definition JSON, and to restore it, the script simply needs to invoke:

bq mk --view query --nouse_legacy_sql todataset.table_name

Enjoy!

1. Do you need to backup to Google Cloud Storage?

After I published this article, Antoine Cas responded on Twitter:

He’s absolutely right. This article assumes that you want a backup in the form of files on Cloud Storage. This might be because your compliance auditor wants the data to be exported out to some specific location, or it might be because you want the backups to be processable by other tools.

If you do not need the backup to be in the form of files, a much simpler way to backup your BigQuery table is use bq cp to create/restore the backup:

# backup
date=...
bq mk dataset_${date}
bq cp dataset.table dataset_${date}.table
# restore
bq cp dataset_20200301.table dataset_restore.table

2. Are the backup files compressed?

Yes! The data is stored in Avro format, and the Avro format employs compression. The two JSON files (table definition and schema) are not compressed, but those are relatively tiny.

As an example, I backed up a BigQuery table with 400 million rows that took 11.1 GB in BigQuery. When storing it as Avro on GCS, the same table was saved as 47 files each of which was 174.2 MB, so 8.2 GB. The two JSON files occupied a total of 1.3 MB, essentially just roundoff. This makes sense because the BigQuery storage is optimized for interactive querying whereas the Avro format is not.

3. Can you actually roundtrip from BigQuery to Avro and back?

BigQuery can export most primitive types and nested and repeated fields into Avro. For full details on the Avro representation, please see the documentation. The backup tool specifies use_avro_logical_types, so DATE and TIME are stored as date and time-micros respectively.

That said, you should verify that the backup/restore works on your table.

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

What Is EBS?

Ideas for coders

Day #2 of Python

Data Cleansing made Simple using SimpleData Management

Using the RaspberryPi Camera with TerrariumPI and Web-Streaming your Reptile!

The Job Market Being A Self-Taught Developer

Change data capture

CS373 Fall 2020: Alexy Correa

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lak Lakshmanan

Lak Lakshmanan

Operating Executive at a technology investment firm; articles are personal observations and not investment advice.

More from Medium

Connecting Steampipe with Google BigQuery

Deployment Topologies for Data Fusion with Shared VPCs

Secure Google Cloud SQL Instances using Private IP: Gotchas & troubleshooting

Optimize costs in BigQuery — 9 solutions