Reading NULLABLE fields with BigQueryIO in Apache Beam

Israel Herraiz
Google Cloud - Community
3 min readJun 21, 2020
Photo by Ben Hershey on Unsplash

When reading data from BigQuery using BigQueryIO in an Apache Beam pipeline written in Java, you can read the records as TableRow (convenient but offering less performance) or as GenericRecord (extra performance, but some extra work may be required in your code too).

In a previous post, I showed that when using GenericRecord, the BigQuery types are more or less easily mapped to Java types — with the exception of NUMERIC, that requires some additional work.

But that's actually only for BigQuery columns that are required. Columns that are nullable need some additional parsing, to extract the values from the records.

When creating the schema for a table in BigQuery, we can have three different modes for the columns: nullable, required and repeated.

The default mode is nullable. A nullable field may contain a value of the type of the column, or, well, null. Easy.

However, when reading the data in BigQueryIO as GenericRecord, Avro schemas don't have nullable types. They have a null type, in addition to other types. So to specify that a field may contain null values, you need to specify several types for the same field in Avro. That's actually called union. For instance, if your field of name myfield that contains strings and is nullable, the Avro schema would be like this one:

{"name": "myfield", "type": ["null", "string"]}

In a GenericRecord, the type of myfield would be union, and we would need to traverse the types and values of that field, to actually recover the value of myfield (either a NULL or a string).

So let's assume we have the following two variables (all code snippets are in Java):

String fieldName;   // the name of the column
GenericRecord rec; // the values in the row to be parsed

We can recover the schema and the type of that field

Schema fieldSchema = rec.getSchema().getField(fieldName).schema();
Schema.Type type = fieldSchema.getType();

And now, it's when we should check if the type is UNION

if (type == Schema.Type.UNION) {      
// We have a NULLABLE field
...

We could also have checked whether the length of fieldSchema.getTypes() is 1 or 2. It will be 1 for required fields, and 2 for nullable fields.

After we have checked that we have a NULLABLE field, we can recover the types, try to identify the NULL type (among the two types contained in the field), and use the other type to parse the value:

List<Schema> types = fieldSchema.getTypes();      
// Let's see which one is the NULL and let's keep the other
if (types.get(0).getType() == Schema.Type.NULL) {
type = types.get(1).getType();
} else {
type = types.get(0).getType();
}

And after that, we can parse the value as just another required value (that is, with only one type). Just beware that the field may contain null if the original value in the table was null.

See the full snippet of code:

When reading from BigQuery, just don't give up on using GenericRecord. Sometimes you will have to do some additional parsing, like in this example with NULLABLE or when you are using NUMERIC fields, but the extra performance you are going to get is worth a couple of more lines of code.

For reading NULLABLE, remember that the type will be an UNION of two types: the actual type of the field and null. And beware of nulls when reading values!

--

--