How to backup a BigQuery table (or dataset) to Google Cloud Storage and restore from it

String together calls to the bq command-line utility

Lak Lakshmanan
Mar 18, 2020 · 3 min read

BigQuery is fully managed and takes care of managing redundant backups in storage. It also supports “time-travel”, the ability to query a snapshot as of a day ago. So, if an ETL job goes bad and you want to revert back to yesterday’s data, you can simply do:

CREATE OR REPLACE TABLE dataset.table_restored
AS
SELECT *
FROM dataset.table
FOR SYSTEM TIME AS OF
TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)

However, time travel is restricted to 7 days. There are situations (playback, regulatory compliance, etc.) when you might need to restore a table as it existed 30 days ago or 1 year ago.

For your convenience, I’ve put together a pair of Python programs for backing up and restoring BigQuery tables and datasets. Get them from this GitHub repository: https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup

Here’s how you use the scripts:

To backup a table to GCS

./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup

The script saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.

You can also backup all the tables in a data set:

./bq_backup.py --input dataset --output gs://BUCKET/backup

Restore tables one-by-one by specifying a destination data set

./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset

The scripts use the BigQuery command-line utility bq to put together a backup and restore solution.

bq_backup.py invokes the following commands:

bq show --schema dataset.table. # schema.json
bq --format=json show dataset.table. # tbldef.json
bq extract --destination_format=AVRO \
dataset.table gs://.../data_*.avro # AVRO files

It saves the JSON files to Google Cloud Storage alongside the AVRO files.

bq_restore.py uses the table definition to find characteristics such as whether the table is time-partitioned, range-partitioned, or clustered and then invokes the command:

bq load --source_format=AVRO \
--time_partitioning_expiration ... \
--time_partitioning_field ... \
--time_partitioning_type ... \
--clustering_fields ... \
--schema ... \
todataset.table_name \
gs://.../data_*.avro

In the case of views, the scripts store and restore the view definition. The view definition is part of the table definition JSON, and to restore it, the script simply needs to invoke:

bq mk --view query --nouse_legacy_sql todataset.table_name

Enjoy!

After I published this article, Antoine Cas responded on Twitter:

He’s absolutely right. This article assumes that you want a backup in the form of files on Cloud Storage. This might be because your compliance auditor wants the data to be exported out to some specific location, or it might be because you want the backups to be processable by other tools.

If you do not need the backup to be in the form of files, a much simpler way to backup your BigQuery table is use bq cp to create/restore the backup:

# backup
date=...
bq mk dataset_${date}
bq cp dataset.table dataset_${date}.table
# restore
bq cp dataset_20200301.table dataset_restore.table

Yes! The data is stored in Avro format, and the Avro format employs compression. The two JSON files (table definition and schema) are not compressed, but those are relatively tiny.

As an example, I backed up a BigQuery table with 400 million rows that took 11.1 GB in BigQuery. When storing it as Avro on GCS, the same table was saved as 47 files each of which was 174.2 MB, so 8.2 GB. The two JSON files occupied a total of 1.3 MB, essentially just roundoff. This makes sense because the BigQuery storage is optimized for interactive querying whereas the Avro format is not.

BigQuery can export most primitive types and nested and repeated fields into Avro. For full details on the Avro representation, please see the documentation. The backup tool specifies use_avro_logical_types, so DATE and TIME are stored as date and time-micros respectively.

That said, you should verify that the backup/restore works on your table.

Google Cloud - Community

Google Cloud community articles and blogs

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store