Automate BigQuery Code Backup With Python Framework Tutorial

Joey Holmesmeyer
5 min readJul 15, 2023

--

You’ve created lots of code in a BigQuery project, but you don’t have a backup of the table definitions, views, functions or procedures. Use this Python frame work to create a file for every entity in your database

Joey Holmesmeyer — Founder of Small Business Intelligence and Google Cloud Evangelist

The Use Case

You are developing in Google BigQuery and you have spent a lot of time setting up table structures, writing stored procedures to load the entities and create analytics products.

The best practice would be to store all your code in a code repository like GitHub, however you or your team doesn’t have this luxury. So the question begs, how can you create a backup copy of all you and your teams hard work?

In the article below I will walk you through how to implement a Python program that I wrote which you can use that leverages the information schema in BigQuery to create a local file directory that contains a sql file for every table, view, stored procedure and function in your BigQuery project.

Getting Started — Prerequisites

If you are a data or analytics engineer working in BigQuery the steps I will outline below will be critical for you to implement and maintain a working knowledge of for this use case and beyond.

  1. Download VS Code (or an IDE for your choice), a free and widely used code editing tool that once mastered will change the way you work
  2. Download Python 3 or higher, once again a free program to download that will eventually end up supercharging your workflow
  3. Download and install Google Cloud CLI, this package is critical to developing using python via the VS Code as it allows you to execute queries and commands in BigQuery from your IDE and local command line
  4. Gain familiarity with the BigQuery information schema. Lots of engineering work isn’t very sexy, it involves tedious work and using metadata from the information schema will not only give you visibility into the work you and your team have completed at scale, but also enable lots of automation

Building the Program — Information Schema SQL Query

The sql query provided below will retrieve a list of all the entities currently in your BigQuery project:

## Query to fetch all the entities and their respective DDLS

WITH

entity AS
(
SELECT
table_catalog,
table_schema,
table_name,
table_type,
ddl,
FROM
`myProject.myDataset.INFORMATION_SCHEMA.TABLES;`

UNION ALL

SELECT
routine_catalog,
routine_schema,
routine_name,
routine_type,
ddl,
FROM
`myProject.myDataset.INFORMATION_SCHEMA.ROUTINES`
)
SELECT
table_catalog AS entity_catalog,
table_schema AS entity_schema,
table_name AS entity_name,
table_type AS entity_type,
ddl AS entity_ddl,
FROM entity

In this query we are selecting data from the information_schema.tables and information_schema.routines backend tables and combining them together on their common fields. The final select statement simply renames our columns to have a more generic names since the columns will contain both routine (stored procedure) and table/views details.

Building the Program — Python Code

The Python code below will create a file for each entity in your BigQuery project!

Here’s how to execute it:

  • The template can only take you so far, make sure you read the code below, make updates such as the directory you will need to write the files to and the BigQuery project you’ll want to extract from (this will be defined in the SQL)
  • Open VS Code
  • Create a new file named ‘bq_ddl_extractor.py’ and paste the code below into the file and hit save
  • Make sure that your Google Cloud CLI is connected to your BigQuery Project and can be interacted with via the VS Code command line terminal (Pro tip — ask GPT to help you do this if you don’t know how)
  • Hit the play button on the upper left hand corner of VS Code while in the newly created Python file
  • The output of your program will contain a parent folder named backup and within that folder you will the following directory structure: Dataset/Entity Type/File.sql
import os
import re # Regular expressions library
from google.cloud import bigquery


# Instantiate a BigQuery client
bq = bigquery.Client()


# Query to fetch all the entities and their respective DDLS
query = """

SELECT
table_catalog,
table_schema,
table_name,
table_type,
ddl,
FROM
`myProject.myDataset.INFORMATION_SCHEMA.TABLES;`

UNION ALL

SELECT
routine_catalog,
routine_schema,
routine_name,
routine_type,
ddl,
FROM
`myProject.myDataset.INFORMATION_SCHEMA.ROUTINES`

"""


# Fetch the data
data = bq.query(query).to_dataframe()


# Iterate over each row
for _, row in data.iterrows():
# Create directories for each dataset and table type if they do not exist
dir_name = os.path.join('sproc_generator/backup', row['table_schema'], row['table_type'])
if not os.path.exists(dir_name):
os.makedirs(dir_name)

# Create the full path for the file
file_path = os.path.join(dir_name, f"{row['table_name']}.sql")

# Write the DDL to a .sql file
with open(file_path, 'w') as f:
# Remove consecutive newlines
cleaned_ddl = re.sub("\r?\n\s*\r?\n*", "\n", row['ddl'])
f.write(cleaned_ddl)

# Print the path of the created file
print(f"File created: {file_path}")

print("Backup completed successfully.")

Explanation of the Python Code:

The Python code we use in this article follows a systematic process to fetch the DDLs for each entity in your BigQuery project, and then writes these to SQL files organized by dataset and entity type. Let’s take a closer look at the steps:

Step 1: Import Required Libraries

import os
import re
from google.cloud import bigquery

This script uses three libraries:

  • os module in Python provides functions for interacting with the operating system.
  • re module allows us to use regular expressions, a powerful tool for string manipulation.
  • google.cloud.bigquery is a client library for Google BigQuery that allows Python code to interact with the BigQuery service.

Step 2: Create BigQuery Client

bq = bigquery.Client()

This line of code initializes a BigQuery client that will handle the interaction with the BigQuery service.

Step 3: Define the SQL Query

The SQL query fetches the required data from the information schema in BigQuery. This includes information about tables and routines (stored procedures and functions), including their DDLs.

Step 4: Execute the Query and Fetch Data

data = bq.query(query).to_dataframe()

Here, the query method of the BigQuery client executes the SQL query, and the to_dataframe method converts the result into a pandas DataFrame.

Step 5: Iterating Over Rows and Writing to SQL Files

The following loop iterates over each row in the fetched DataFrame. For each row, it:

  • Creates a directory path based on the entity’s schema and type.
  • Checks if this directory already exists, and if not, creates it.
  • Constructs a file path for the .sql file and writes the DDL (cleaned of consecutive newlines) into the file.
for _, row in data.iterrows():
dir_name = os.path.join('sproc_generator/backup', row['entity_schema'], row['entity_type'])
if not os.path.exists(dir_name):
os.makedirs(dir_name)
    file_path = os.path.join(dir_name, f"{row['entity_name']}.sql")

with open(file_path, 'w') as f:
cleaned_ddl = re.sub("\r?\n\s*\r?\n*", "\n", row['entity_ddl'])
f.write(cleaned_ddl)

print(f"File created: {file_path}")

Step 6: Backup Completion Message

After the loop has processed all the rows in the DataFrame, the program prints a success message:

print("Backup completed successfully.")

This way, the user will know that the script has finished running and that all entity DDLs have been backed up.

In conclusion, this script uses the capabilities of Python, BigQuery’s information schema, and the local file system to create an organized backup of your BigQuery project. The combination of these tools makes it a powerful and flexible solution for this task.

Thanks for reading!

Hope you enjoyed this tutorial and are able to put this concept to use on your next or current BigQuery project. If you have any questions or comments I would love to hear from you so please drop a comment below! If you think this is interesting but would like our team to implement it, we’d love to help with that too, book 15 minutes on our calendar to discuss!

Joey Holmesmeyer — Founder @ Smallbusinessintelligence.co

--

--