Auto-Generate BigQuery Schema from a Pandas DataFrame
Save yourself time and headaches creating BigQuery schemas from your DataFrame with this one simple function
Intro
If you’ve ever used Python to load data into BigQuery, you’ll know that one of the most annoying parts is related to the table's schema.
You try to let the load job auto-detect the schema and get an error, or you have to go through and manually specify every schema definition and supply it along with your load command.
Either way gets annoying, so I’ve written a neat function that takes the task away from us and provides us with the exact output we need for the load job.
The beauty of this function is that it automatically handles STRUCT and ARRAY records in our DataFrame.
A STRUCT is if we have a field in our DataFrame, which is actually a dictionary, e.g. a Person
field with Age
and Name
as two values within it.
An ARRAY is if we have a field that is a list of values, i.e. REPEATED in BigQuery’s terminology.
You can even have a case where you have a repeated nested field, e.g. a field called Movies
which is a list of all of the movies somebody has watched, but each movie has a Name
and Year
attribute.
Code
Let’s start by creating some fake data which covers some different scenarios that we need to handle.
DATA = {
"name": "Danilo",
"age": 32,
"date_joined": "2020-11-05",
"location": {"country": "United Kingdom", "city": "London"},
"years_active": [2020, 2021, 2022],
"favourite_movies": [
{"name": "Momento", "year": "2000"},
{"name": "Se7en", "year": "1995"},
{"name": "Momento", "year": "2000"},
],
}
We have a string field, a string which we can turn into a date object, a nested field, a list, and a list of nested fields.
The first step is for us to convert this to a DataFrame object and then specify that date_joined
should be a datetime value.
df = pd.DataFrame([DATA])
df["date_joined"] = pd.to_datetime(df["date_joined"])
Now, with this, we want to create the BigQuery schema definition, so here’s the code for the function generate_bigquery_schema()
:
The first thing we do is specify a TYPE_MAPPING
— this takes the underlying numpy.dtype.kind
value (read about here) and maps it to what should be stored in BigQuery.
For datetime values, we prefer TIMESTAMP
over DATETIME
because the former comes with timezone information (if unspecified, it will be stored as UTC timezone).
We initialise an empty list called schema
(this is where we will store the list of SchemaField
values).
Then we start iterating through the column and dtypes within our DataFrame.
We set mode
to REPEATED
if the value is a list, otherwise we use NULLABLE
.
Note — There is one additional value that can be used in place for mode
, and that is REQUIRED
. In this function, we assume no fields are required, but if we wanted to extend the function, we could accept another argument for required_fields
and use that to set REQUIRED
if needed.
The next condition is to test for nested fields (dict) or repeated nested fields (list of dict), and in that case, we convert the value into a DataFrame using pd.json_normalize
and recursively run the generate_bigquery_schema
function. The value gets stored in the field
variable, or, in the else condition, we set the variable to an empty tuple.
Note — pd.json_normalize
takes nested data and normalises it into a flat format where each field in the nested structure becomes its own column in the DataFrame.
With the field
variable populated, we set the type
value to be RECORD
in the case where there are fields, otherwise we use the TYPE_MAPPING
to set the type
of the field.
Finally, we append the final SchemaField
definition to schema
variable, and once all of the fields are completed, we return the schema
.
Full Working Script
Now that we’ve run through how the generate_bigquery_schema
function works, here’s a full working script for it that creates a table in a BigQuery project under the schema dataset.table
.
Once we’ve created the schema, we:
- Convert our DataFrame to JSON
- Initialise a BigQuery client
- Create a
LoadJobConfig
with our schema - Use
client.load_table_from_file
to load the data to BigQuery