Reading NUMERIC fields with BigQueryIO in Apache Beam

Israel Herraiz
Google Cloud - Community
4 min readJun 15, 2020
Reading NUMERIC fields from BigQuery using Apache Beam's BigQueryIO

Among all the number types in BigQuery, NUMERIC offers the longest precision (38 digits) and scale (9 digits), which makes it the preferred option for applications that require very long precisions. Handling NUMERIC values in BigQuery is straightforward, but not so much in Apache Beam pipeline that reads data from BigQuery. In this post, we will see how we can read this NUMERIC type from BigQuery with BigQueryIO, so we can use decimal values with the longest possible precision in our pipeline.

When reading data from BigQuery using Apache Beam and BigQueryIO, there are two main ways to read the records from BigQuery: using read(SerializableFunction) (to transform an Avro GenericRecord into a custom class) or readTableRows (easier to use, but offering lower performance).

So read offers the optimal performance, but it comes at a developer cost: you need to parse each field into the right Java type.

Not so much of a hassle, is it? Well, let's look at the conversion table from the documentation of BigQueryIO:

Type conversions when reading Avro records from BigQuery

All the types seem quite straightforward (BOOLEAN maps to boolean, FLOAT64 to double, etc.).

But, wait a minute, what's the Java type for NUMERIC? It is a ByteBuffer…

How can we actually transform that into a number in Java?

Let's how we can parse Avro fields and map them to Java types in Beam.

Imagine that we are reading from BigQuery, and we want to map a row in a table to our custom class SomeCustomObject . We can do so in BigQueryIO with something like the following:

We get the row value in a GenericRecord and we need now to parse the contents into the right type. Using the table above, this is a trivial task for most of the basic types. For instance for a String, a Long or a Double number:

The code above assumes that fieldName1 is a column name that contain Strings, fieldName2 a column that contains Longs, and fieldName3 a column that contains Doubles. In a production environment, it would be wise to actually check the type of the field. We see below how to check the type.

How do we do that transform if the column in BigQuery contains a NUMERIC?

That transform is more complex. As a summary, we need to map the array of bytes to a java.math.BigDecimal , because other number types will not have enough precision to hold a NUMERIC. But doing such mapping to BigDecimal requires several steps.

Let's do things correct from the beginning. Let's first to check that the type is BYTES. When working with GenericRecords, I actually try to always write a function that uses a switch to try to find the right parsing code for each Avro type. In the code below, we only include code for a NUMERIC object. Later in this post, I include a link with the full code for more types in the same switch.

After we make sure that the type is BYTES (line 8), we check if there is a logical type or not. What is that? BYTES is actually used for two BigQuery types: BYTES and NUMERIC. The type NUMERIC has an additional logical type, which contains the precision and scale of the number encoded in the Avro GenericRecord.

We need to find out the scale in order to be able to parse the array of bytes as a number. That's done in lines 16 and 17 of the above snippet; the scale is actually part of the schema of the Avro GenericRecord.

The value in the GenericRecord is encoded as a ByteBuffer (line 19). Then we take the ByteBuffer and parse it as a BigDecimal, using the value of scale that we just recovered (lines 20–22).

How do we combine all of the above to read data from BigQuery in Beam? I have added a gist with a full pipeline example. In this example, the input is a table with just two fields: key is a column of type STRING, and value is a column of type NUMERIC. The example contains a generic function to parse different types of data from a Generic Record. We need to know the type of the field we are reading to use the function (see an example of how to use that function to read the key and value columns). The function is slightly more complex than the code shown in the rest of snippets, because BigQuery columns that are NULLABLE are actually represented as an union in Avro. But NULLABLE fields will be the topic of another post! — note: post on NULLABLE fields published!

So if you have a column of type NUMERIC, don't be frustrated trying to read it from your Apache Beam pipeline. You can still use GenericRecord and squeeze some more performance than using TableRows , you just need to extract the scale of the number from the schema, and use that to construct a BigDecimal object.

--

--