Auto-Generate BigQuery Schema from a Pandas DataFrame

Save yourself time and headaches creating BigQuery schemas from your DataFrame with this one simple function

4 min readDec 30, 2022

Data with Dro — Auto-Generate BigQuery Schema

Intro

If you’ve ever used Python to load data into BigQuery, you’ll know that one of the most annoying parts is related to the table's schema.

You try to let the load job auto-detect the schema and get an error, or you have to go through and manually specify every schema definition and supply it along with your load command.

Either way gets annoying, so I’ve written a neat function that takes the task away from us and provides us with the exact output we need for the load job.

The beauty of this function is that it automatically handles STRUCT and ARRAY records in our DataFrame.

A STRUCT is if we have a field in our DataFrame, which is actually a dictionary, e.g. a Person field with Age and Name as two values within it.

An ARRAY is if we have a field that is a list of values, i.e. REPEATED in BigQuery’s terminology.

You can even have a case where you have a repeated nested field, e.g. a field called Movies which is a list of all of the movies somebody has watched, but each movie has a Name and Year attribute.

Code

Let’s start by creating some fake data which covers some different scenarios that we need to handle.

DATA = {
    "name": "Danilo",
    "age": 32,
    "date_joined": "2020-11-05",
    "location": {"country": "United Kingdom", "city": "London"},
    "years_active": [2020, 2021, 2022],
    "favourite_movies": [
        {"name": "Momento", "year": "2000"},
        {"name": "Se7en", "year": "1995"},
        {"name": "Momento", "year": "2000"},
    ],
}

We have a string field, a string which we can turn into a date object, a nested field, a list, and a list of nested fields.

The first step is for us to convert this to a DataFrame object and then specify that date_joined should be a datetime value.

df = pd.DataFrame([DATA])
df["date_joined"] = pd.to_datetime(df["date_joined"])

Now, with this, we want to create the BigQuery schema definition, so here’s the code for the function generate_bigquery_schema() :

The first thing we do is specify a TYPE_MAPPING — this takes the underlying numpy.dtype.kind value (read about here) and maps it to what should be stored in BigQuery.

For datetime values, we prefer TIMESTAMP over DATETIME because the former comes with timezone information (if unspecified, it will be stored as UTC timezone).

We initialise an empty list called schema (this is where we will store the list of SchemaField values).

Then we start iterating through the column and dtypes within our DataFrame.

We set mode to REPEATED if the value is a list, otherwise we use NULLABLE.

Note — There is one additional value that can be used in place for mode , and that is REQUIRED. In this function, we assume no fields are required, but if we wanted to extend the function, we could accept another argument for required_fields and use that to set REQUIRED if needed.

The next condition is to test for nested fields (dict) or repeated nested fields (list of dict), and in that case, we convert the value into a DataFrame using pd.json_normalize and recursively run the generate_bigquery_schema function. The value gets stored in the field variable, or, in the else condition, we set the variable to an empty tuple.

Note — pd.json_normalize takes nested data and normalises it into a flat format where each field in the nested structure becomes its own column in the DataFrame.

With the field variable populated, we set the type value to be RECORD in the case where there are fields, otherwise we use the TYPE_MAPPING to set the type of the field.

Finally, we append the final SchemaField definition to schema variable, and once all of the fields are completed, we return the schema.

Full Working Script

Now that we’ve run through how the generate_bigquery_schema function works, here’s a full working script for it that creates a table in a BigQuery project under the schema dataset.table.

Once we’ve created the schema, we:

Convert our DataFrame to JSON
Initialise a BigQuery client
Create a LoadJobConfig with our schema
Use client.load_table_from_file to load the data to BigQuery

Auto-Generate BigQuery Schema from a Pandas DataFrame

Save yourself time and headaches creating BigQuery schemas from your DataFrame with this one simple function

Intro

Code

Full Working Script

Written by Danilo Drobac