BigQuery: deltas to latest

Published in

Google Cloud - Community

2 min readMay 13, 2020

The Challenge

A colleague presented me with a challenge:

I have a series of updates with only the changed fields set. NULLs indicate no change. I have the original records. How do I handle this in BigQuery?

Putting this together into a BigQuery script will be for another day, but for now the focus will be using BigQuery ARRAY_AGG for processing the changes.

UPDATE — a additional pattern has been added to BigQuery: deltas to latest — all history.

The Solution

Test data

The OriginalData will capture the current non-null state of the all of the fields. In this example, name is the primary key and id is an incrementing INT64 that represents the “latest” and best value.

OriginalData AS (SELECT * FROM UNNEST([
  STRUCT('nameA' AS name, 0 AS id, 20 AS age, 'superA' AS type),
  STRUCT('nameB' AS name, 0 AS id, 21 AS age, 'superB' AS type),
  STRUCT('nameC' AS name, 0 AS id, 22 AS age, 'superC' AS type),
  STRUCT('nameD' AS name, 0 AS id, 23 AS age, 'superD' AS type)
]))

The UpdatedData represents the incoming deltas. Name and id are always set, but age and type may be NULL

UpdatedData AS (SELECT * FROM UNNEST([
  STRUCT('nameA' AS name,
    3 AS id,
    10 AS age,
  CAST(NULL AS STRING) AS type),
      STRUCT('nameA' AS name,
    2 AS id,
    23 AS age,
    'superA-2' AS type),
  STRUCT('nameA' AS name,
    4 AS id,
    CAST(NULL AS INT64) AS age,
    CAST(NULL AS STRING) AS type)
]))

Query

Putting it together into a single runnable query.

The data is UNION’d together so we have the changes and the original and group’d by the name.

The ARRAY_AGG is using on a field-by-field basis does the following:

Arrange all of the non-null fields into an ARRAY
Sort them by ID descending (so the value in the first position is the most greatest ID)
Pull out the offset at 0

All together in a query:

WITH
  OriginalData AS (SELECT * FROM UNNEST([
    STRUCT('nameA' AS name, 0 AS id, 20 AS age, 'superA' AS type),
    STRUCT('nameB' AS name, 0 AS id, 21 AS age, 'superB' AS type),
    STRUCT('nameC' AS name, 0 AS id, 22 AS age, 'superC' AS type),
    STRUCT('nameD' AS name, 0 AS id, 23 AS age, 'superD' AS type)
  ])),
  UpdatedData AS (SELECT * FROM UNNEST([
    STRUCT('nameA' AS name,
      3 AS id,
      10 AS age,
      CAST(NULL AS STRING) AS type),
    STRUCT('nameA' AS name,
      2 AS id,
      23 AS age,
      'superA-2' AS type),
    STRUCT('nameA' AS name,
      4 AS id,
      CAST(NULL AS INT64) AS age,
      CAST(NULL AS STRING) AS type)
  ]))
SELECT
  name,
  MAX(id) AS id,
  ARRAY_AGG(age IGNORE NULLS ORDER BY id DESC)[OFFSET(0)] AS age,
  ARRAY_AGG(type IGNORE NULLS ORDER BY id DESC)[OFFSET(0)] AS type
FROM
(
  SELECT * FROM OriginalData
  UNION ALL
  SELECT * FROM UpdatedData
)
GROUP BY
  name;

Conclusion

This shows how ARRAY_AGG can be used on a field-by-field basis in order to find the most-recent non-NULL value.

Using BigQuery scripting, this can be incorporated into a script that applies deltas to a snapshot and updated the snapshot. That will be for another day.