BigQuery: delta to latest — all history
The Challenge
Building on BigQuery: delta to latest, rather than taking multiple changes and finding the latest value we want to capture the changes over time.
Rather than using ARRAY_AGG this will use LAST_VALUE as an analytic function.
The Solution
Test Data
First we will create a set of the test data called testing.sparsedata. This will have a unique key id, a timestamp ts, and columns c1 through c9 that contain random data or NULL.
CREATE OR REPLACE TABLE `testing.sparsedata` AS
SELECT
CAST(FLOOR(RAND()*1000) AS INT64) AS id,
TIMESTAMP_ADD(
CURRENT_TIMESTAMP(),
INTERVAL CAST(FLOOR(RAND()*1000) AS INT64) SECOND) AS ts,
IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c1,
IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c2,
IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c3,
IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c4,
IF(RAND()>0.75, CAST(RAND() AS STRING), NULL) AS c5,
IF(RAND()>0.75,
CONCAT("id=", CAST(RAND() AS STRING)),
NULL) AS c6,
IF(RAND()>0.75,
CONCAT("id=", CAST(RAND() AS STRING)),
NULL) AS c7,
IF(RAND()>0.75,
CONCAT("id=", CAST(RAND() AS STRING)),
NULL) AS c8,
IF(RAND()>0.75,
CONCAT("id=", CAST(RAND() AS STRING)),
NULL) AS c9
FROM
UNNEST(GENERATE_ARRAY(1, 1000000)) AS d
Query
Turning that data into a series of records where for each column c1 through c9, it has the most recent non-null data can be done with LAST_VALUE and IGNORE NULLs analytic functions.
In this case, a named window is also used to simplify writing the SQL but also potentially an optimization, i.e., the window may be calculated only once.
The query is as follows:
SELECT
id,
ts,
LAST_VALUE(c1 IGNORE NULLS) OVER (sparse_data) AS c1,
LAST_VALUE(c2 IGNORE NULLS) OVER (sparse_data) AS c2,
LAST_VALUE(c3 IGNORE NULLS) OVER (sparse_data) AS c3,
LAST_VALUE(c4 IGNORE NULLS) OVER (sparse_data) AS c4,
LAST_VALUE(c5 IGNORE NULLS) OVER (sparse_data) AS c5,
LAST_VALUE(c6 IGNORE NULLS) OVER (sparse_data) AS c6,
LAST_VALUE(c7 IGNORE NULLS) OVER (sparse_data) AS c7,
LAST_VALUE(c8 IGNORE NULLS) OVER (sparse_data) AS c8,
LAST_VALUE(c9 IGNORE NULLS) OVER (sparse_data) AS c9
FROM
`testing.sparsedata`
WINDOW sparse_data AS (
PARTITION BY id
ORDER BY ts ASC
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
Conclusion
Using analytic functions a similar affect can be achieved as ARRAY_AGG, including filling in sparse data.