BigQuery: delta to latest — all history

Published in

Google Cloud - Community

2 min readJul 16, 2020

The Challenge

Building on BigQuery: delta to latest, rather than taking multiple changes and finding the latest value we want to capture the changes over time.

Rather than using ARRAY_AGG this will use LAST_VALUE as an analytic function.

The Solution

Test Data

First we will create a set of the test data called testing.sparsedata. This will have a unique key id, a timestamp ts, and columns c1 through c9 that contain random data or NULL.

CREATE OR REPLACE TABLE `testing.sparsedata` AS
SELECT
  CAST(FLOOR(RAND()*1000) AS INT64) AS id,
  TIMESTAMP_ADD(
    CURRENT_TIMESTAMP(),
    INTERVAL CAST(FLOOR(RAND()*1000) AS INT64) SECOND) AS ts,
  IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c1,
  IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c2,
  IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c3,
  IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c4,
  IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c5,
  IF(RAND()>0.75,
    CONCAT("id=", CAST(RAND() AS STRING)),
    NULL) AS c6,
  IF(RAND()>0.75,
    CONCAT("id=", CAST(RAND() AS STRING)),
    NULL) AS c7,
  IF(RAND()>0.75,
    CONCAT("id=", CAST(RAND() AS STRING)),
    NULL) AS c8,
  IF(RAND()>0.75,
    CONCAT("id=", CAST(RAND() AS STRING)),
    NULL) AS c9
FROM
  UNNEST(GENERATE_ARRAY(1, 1000000)) AS d

Query

Turning that data into a series of records where for each column c1 through c9, it has the most recent non-null data can be done with LAST_VALUE and IGNORE NULLs analytic functions.

In this case, a named window is also used to simplify writing the SQL but also potentially an optimization, i.e., the window may be calculated only once.

The query is as follows:

SELECT
  id,
  ts,
  LAST_VALUE(c1 IGNORE NULLS) OVER (sparse_data) AS c1,
  LAST_VALUE(c2 IGNORE NULLS) OVER (sparse_data) AS c2,
  LAST_VALUE(c3 IGNORE NULLS) OVER (sparse_data) AS c3,
  LAST_VALUE(c4 IGNORE NULLS) OVER (sparse_data) AS c4,
  LAST_VALUE(c5 IGNORE NULLS) OVER (sparse_data) AS c5,
  LAST_VALUE(c6 IGNORE NULLS) OVER (sparse_data) AS c6,
  LAST_VALUE(c7 IGNORE NULLS) OVER (sparse_data) AS c7,
  LAST_VALUE(c8 IGNORE NULLS) OVER (sparse_data) AS c8,
  LAST_VALUE(c9 IGNORE NULLS) OVER (sparse_data) AS c9
FROM
  `testing.sparsedata`
WINDOW sparse_data AS (
  PARTITION BY id
  ORDER BY ts ASC
  RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Conclusion

Using analytic functions a similar affect can be achieved as ARRAY_AGG, including filling in sparse data.