Time series analytics with BigQuery

Techniques for tumbles, fills, and linear interpolation

Published in

Google Cloud - Community

7 min readNov 13, 2020

(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with bqutil.fn.ts_tumble())

As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.

For the purposes of demonstration, the samples in this guide use the public San Francisco 311 dataset, performing time series analysis using the created_date and supervisor_district for the timestamp and series key, respectively. These examples can be applied to common time series problems like log analytics or IoT monitoring.

This guide requires you to create new UDFs and assumes you will be creating them in a dataset named ‘timeseries’.

Creating and grouping by a tumble function

Sometimes called a time bucket or time slice, a tumble represents a non-overlapping time window the data can fall into. This differs from a timestamp_trunc() function because it can aggregate on a window of any number of seconds.

This function is fairly simple, it rounds down by the tumble size, allowing values to be aligned and aggregated into the same time group.

CREATE OR REPLACE FUNCTION timeseries.tumble_interval(
 val TIMESTAMP, tumble_seconds INT64)
AS (
 timestamp_seconds(div(UNIX_SECONDS(val), tumble_seconds) *  tumble_seconds))
);

Testing the function, shows the tumble column which rounds down to 15 minute (900 seconds) intervals.

SELECT
 unique_key,
 supervisor_district,
 created_date,
 timeseries.tumble_interval(created_date, 900) tumble
FROM `bigquery-public-data.san_francisco.311_service_requests`
WHERE supervisor_district in (6, 11)
ORDER BY supervisor_district, tumble desc

The rounded down tumbling window can be used to aggregate data for that time segment, e.g. the count of events that occurred during that time window.

SELECT
 supervisor_district,
 timeseries.tumble_interval(created_date, 900) tumble,
 count(*) as count
FROM `bigquery-public-data.san_francisco.311_service_requests`
WHERE supervisor_district in (6, 11)
GROUP BY supervisor_district, tumble
order by tumble desc

Handling missing values

Looking at the query results, there is a problem. The time ranges that did not have any events, show up as completely missing records. This can be particularly problematic in IoT scenarios where an important metric is calculated between two or more sensor values.

The most common techniques for handling missing values are to: fill the time window with a default value like ‘0’, the nearest preceding valid value, or plot an estimated value using linear interpolation.

Generate timestamp candidates

In order to create synthetic rows in BigQuery we are going to rely on the UNNEST operation which converts an array of values into rows and the generate timestamp array function to create the array of candidate values to fill in the blanks. The synthetic rows can then be left outer joined onto the original time series aggregation to fill in blanks.

To simplify the task we will create the gen_ts_candidates(..) function that returns an array of struct values containing timestamp and series key.

CREATE OR REPLACE FUNCTION
timeseries.gen_ts_candidates(keys ARRAY<STRING>, tumble_seconds INT64, min_ts TIMESTAMP, max_ts Timestamp)
AS ((
 SELECT ARRAY_AGG(x)
 FROM (
   SELECT series_key, tumble_val
   FROM UNNEST(
     GENERATE_TIMESTAMP_ARRAY(
       timeseries.tumble_interval(min_ts, tumble_seconds),
       timeseries.tumble_interval(max_ts, tumble_seconds),
       INTERVAL tumble_seconds SECOND
     )
   ) AS tumble_val
   CROSS JOIN UNNEST(keys) AS series_key
 ) x
));

Filling the record gaps with a join

The complex SQL shown below shows a technique for filling values.

The named subquery requests represents the original aggregated time window query.
The named subquery args creates arguments for the gen_ts_candidates function. Alternatively you can directly input your args in the query.
The generated candidate time series are then left outer joined to the original table to fill the missing rows.

WITH requests as (
 SELECT
   cast(supervisor_district as STRING) supervisor_district,
   timeseries.tumble_interval(created_date, 900) as tumble,
   count(*) as count
 FROM `bigquery-public-data.san_francisco.311_service_requests` a
 WHERE supervisor_district in (6, 11)
 AND CAST(created_date as DATE) = '2018-01-18'
 GROUP BY supervisor_district, tumble
),
args AS (
 SELECT
   array_agg(DISTINCT supervisor_district) as key,
   min(tumble) as min_ts,
   max(tumble) as max_ts
 FROM requests
)
SELECT *
FROM UNNEST(
  (SELECT
     timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
   FROM args)
) a
LEFT OUTER JOIN requests b
 ON a.series_key = b.supervisor_district
 AND a.tumble_val = b.tumble
ORDER BY tumble_val DESC

Filling with last value or a default

Once there is a record for each time window, the LAST_VALUE window function or a default value can be used to fill in the missing value for those new records. The LAST_VALUE function will take the last non-null value for the window specification, which includes the rows current value; while the default value will replace the current value with a constant, if it is null.

Building on the prior example, you can now see those empty windows filled with valid values. This example also orders by the series key, supervisor district, to make the LAST_VALUE functionality more apparent.

...
SELECT 
 series_key as supervisor_district,
 tumble_val as tumble,
 LAST_VALUE(count IGNORE NULLS)
   OVER (PARTITION BY series_key
         ORDER BY tumble_val ASC
         RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
 AS last_value,
 coalesce(count, 0) AS def,
 count as unfilled
FROM UNNEST(
  (SELECT
     timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
   FROM args)
) a
LEFT OUTER JOIN requests b
 ON a.series_key = b.supervisor_district
 AND a.tumble_val = b.tumble
ORDER BY supervisor_district, tumble_val DESC

Linear interpolation between valid values

Linear interpolation is a simple concept whose math will be immediately recognizable. The function finds the slope between preceding and following values and you use a straight line equation (y=mx + b) to determine the value of the current position. The trick is in providing the relative coordinates of the values being used. A BigQuery struct can be used as a complex type containing the (x,y) coordinates; the number of timestamp seconds can be used for relative position on the x-axis.

Create the linear_interpolation function and a convenience function, interpolate_ts_val_or_fill() to translate timestamps and handle error cases.

CREATE OR REPLACE function
timeseries.linear_interpolate(pos INT64,
        prev STRUCT<x INT64,y FLOAT64>,
        next STRUCT<x INT64,y FLOAT64>)
AS (
CASE
  WHEN prev IS NULL OR next IS NULL THEN null
  ELSE
  --y = m * x + b
  (next.y - prev.y)/(next.x - prev.x) * (pos - prev.x) + prev.y
END
);CREATE OR REPLACE FUNCTION
timeseries.interpolate_ts_val_or_fill(value FLOAT64,
        pos TIMESTAMP,
        prev STRUCT<x TIMESTAMP, y FLOAT64>,
        next STRUCT<x TIMESTAMP, y FLOAT64>,
        def FLOAT64)
AS (
 CASE
   WHEN value IS NOT NULL THEN value
   WHEN prev IS NULL OR next IS NULL THEN def
   ELSE
     timeseries.linear_interpolate(
       unix_seconds(pos),
       STRUCT(unix_seconds(prev.x) AS x, prev.y AS y),
       STRUCT(unix_seconds(next.x) AS x, next.y AS y)
     )
 END
);

Using structs for the coordinates is important because they allow multiple values to be packaged in a way that can be picked up by the LAST_VALUE and FIRST_VALUE window functions.

Rewriting the full SQL used to interpolate times series gaps; the notable changes are marked in bold.

WITH requests as (
 SELECT *,
   IF(count is null, null,
     STRUCT(tumble as x, count as y)
   ) AS coord
 FROM (
   SELECT
     cast(supervisor_district as STRING) supervisor_district,
     timeseries.tumble_interval(created_date, 900) as tumble,
     count(*) as count
   FROM `bigquery-public-data.san_francisco.311_service_requests` a
   WHERE supervisor_district in (6, 11)
   AND CAST(created_date as DATE) = '2018-01-18'
   AND created_date < '2018-01-18 08:30:00'
   GROUP BY supervisor_district, tumble
 )
),
args AS (
 SELECT
   array_agg(DISTINCT supervisor_district) as key,
   min(tumble) as min_ts,
   max(tumble) as max_ts
 FROM requests
)
SELECT 
   series_key as supervisor_district,
   tumble_val as tumble,
   timeseries.interpolate_ts_val_or_fill(
     count,
     tumble_val,
     LAST_VALUE(coord IGNORE NULLS)
         OVER (PARTITION BY series_key
           ORDER BY unix_seconds(tumble_val) ASC
           RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
     FIRST_VALUE(coord IGNORE NULLS)
         OVER (PARTITION BY series_key
           ORDER BY unix_seconds(tumble_val) ASC
           RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
     0
 ) as intrp,
 count as unfilled
FROM UNNEST(
  (SELECT
     timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
   FROM args)
) a
LEFT OUTER JOIN requests b
 ON a.series_key = b.supervisor_district
 AND a.tumble_val = b.tumble
ORDER BY supervisor_district, tumble_val DESC

Linear interpolation clearly seen between rows 5 and 9

Comparing two series

At first glance, even with functions, this is too much SQL to line up and compare two series. A view can make the querying more manageable.

CREATE OR REPLACE VIEW timeseries.311_req_series
AS (
  WITH requests AS (
    SELECT *,
    IF(count, null, null,
      STRUCT(tumble as x, count as y)
    ) AS coord
...
);

Now a simple join can be used to compare between two series.

SELECT
  ABS(a.intrp - b.intrp) AS difference,
  *
FROM timeseries.311_req_series a
LEFT OUTER JOIN (
  SELECT * FROM timeseries.311_req_series
  WHERE supervisor_district = '6'
) b
ON a.tumble = b.tumble
WHERE a.supervisor_district = '11';

difference between the interpolated values of series 6 and 11

Additional resources

In a future post I plan on exploring sliding windows and alternative interpolation functions.

BigQuery’s Standard SQL reference
Windowing in Dataflow SQL
Interpolation techniques
Flattening arrays in BigQuery

Time series analytics with BigQuery

Techniques for tumbles, fills, and linear interpolation

Creating and grouping by a tumble function

Filling with last value or a default

Additional resources

Written by Patrick Dunn