BigQuery’s schema auto-detection does not work perfectly like we want it to, so we build our own.

Andy Yan
99.co
Published in
3 min readNov 15, 2022
Red Fort, New Delhi, 2018

Introduction

The story typically goes like this: You discover the BigQuery’s schema auto-detection functionality. Elated by the fact that you no longer need to define your BigQuery schema manually, you start migrating your workflow to use schema auto-detection. Everything works perfectly fine… until it doesn’t. New data that comes in are incompatible with existing tables. You re-read the documentation and realised that BigQuery only selects a random file in the data source and scans up to the first 500 rows of data to use as a representative sample to infer the column type. Now you are wondering if you need to go back to specifying the schema manually to save yourself from reacting to such incidents that crops up unexpectedly.

If you are experiencing this, you are not alone. Our team has gone through the same motions during the initial days of using BigQuery. Due to design constraints, we don’t usually have the luxury of getting the schema of our source data directly. We normally work with microservices that serve us data in a JSON payload. Therefore, inferring the appropriate schema as we consume the data is the only way. Based on our requirements, we have created this handy library that generates a schema that is able to accommodate all data type present so that data could be loaded into the table without a hitch.

Library in Action

To illustrate the shortfall of BigQuery’s schema auto-detection, I have prepared this sample data:

https://gist.github.com/zshaoz/835741ad5de0dd0aae748133bfb47774

The sample data consist of 49,999 records of whose columns have the same data type, with 1 record having a different data type (see last line).

Using bq load with autodetect enabled:

bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON test_dataset.dummy_table sample_data_1.json

The result:

Running `bq load` with ` — autodetect` enabled

The auto-detection mechanism fails to take into account that one record with a differing data type. So, the sample data fails to load.

Using our own library, bq-schema-generator:

First, load the sample data, then generate the BigQuery schema:

Inspect the generated schema file:

As you can see, the schema generator takes into account that one record that have a different data type and pick a column type that is able to handle both data types. For the column "int_mix_float" , column type FLOAT is chosen so that it can store both integer and float values.

Then, run bq load with the generated schema:

Success!

On top of that, the library is able to recursively parse data that have deeply nested fields, e.g. data that has a list of dictionaries inside a list of dictionaries. This has really saved us the nightmare of manually specifying them one by one!

Sharing is Caring

We have made the library public for everyone to use. Feel free to fork it or suggest any improvements.

Note: This library is still in its nascent stage. Since it is created according to our own requirements, there may still be some use cases that we might not cover. Here are some caveats:

  • Columns with only integer values only are always defaulted to float since we are always anticipating future numeric values which could be a float
  • Columns with boolean and float values will be defaulted to boolean. Ideally, they should be defaulted to string.

P.S. Credits also goes to Goh Jia Wei and Daryl Lim for contributing to this project.

--

--