Time series analytics with BigQuery

Techniques for tumbles, fills, and linear interpolation

Patrick Dunn
Nov 13, 2020 · 7 min read

(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with bqutil.fn.ts_tumble())

As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.

For the purposes of demonstration, the samples in this guide use the public San Francisco 311 dataset, performing time series analysis using the created_date and supervisor_district for the timestamp and series key, respectively. These examples can be applied to common time series problems like log analytics or IoT monitoring.

This guide requires you to create new UDFs and assumes you will be creating them in a dataset named ‘timeseries’.

Creating and grouping by a tumble function

Sometimes called a time bucket or time slice, a tumble represents a non-overlapping time window the data can fall into. This differs from a timestamp_trunc() function because it can aggregate on a window of any number of seconds.

This function is fairly simple, it rounds down by the tumble size, allowing values to be aligned and aggregated into the same time group.

CREATE OR REPLACE FUNCTION timeseries.tumble_interval(
val TIMESTAMP, tumble_seconds INT64)
AS (
timestamp_seconds(div(UNIX_SECONDS(val), tumble_seconds) * tumble_seconds))
);

Testing the function, shows the tumble column which rounds down to 15 minute (900 seconds) intervals.

SELECT
unique_key,
supervisor_district,
created_date,
timeseries.tumble_interval(created_date, 900) tumble
FROM `bigquery-public-data.san_francisco.311_service_requests`
WHERE supervisor_district in (6, 11)
ORDER BY supervisor_district, tumble desc

The rounded down tumbling window can be used to aggregate data for that time segment, e.g. the count of events that occurred during that time window.

SELECT
supervisor_district,
timeseries.tumble_interval(created_date, 900) tumble,
count(*) as count
FROM `bigquery-public-data.san_francisco.311_service_requests`
WHERE supervisor_district in (6, 11)
GROUP BY supervisor_district, tumble
order by tumble desc

Handling missing values

Looking at the query results, there is a problem. The time ranges that did not have any events, show up as completely missing records. This can be particularly problematic in IoT scenarios where an important metric is calculated between two or more sensor values.

The most common techniques for handling missing values are to: fill the time window with a default value like ‘0’, the nearest preceding valid value, or plot an estimated value using linear interpolation.

Generate timestamp candidates

In order to create synthetic rows in BigQuery we are going to rely on the UNNEST operation which converts an array of values into rows and the generate timestamp array function to create the array of candidate values to fill in the blanks. The synthetic rows can then be left outer joined onto the original time series aggregation to fill in blanks.

To simplify the task we will create the gen_ts_candidates(..) function that returns an array of struct values containing timestamp and series key.

CREATE OR REPLACE FUNCTION
timeseries.gen_ts_candidates(keys ARRAY<STRING>, tumble_seconds INT64, min_ts TIMESTAMP, max_ts Timestamp)
AS ((
SELECT ARRAY_AGG(x)
FROM (
SELECT series_key, tumble_val
FROM UNNEST(
GENERATE_TIMESTAMP_ARRAY(
timeseries.tumble_interval(min_ts, tumble_seconds),
timeseries.tumble_interval(max_ts, tumble_seconds),
INTERVAL tumble_seconds SECOND
)
) AS tumble_val
CROSS JOIN UNNEST(keys) AS series_key
) x
));

Filling the record gaps with a join

The complex SQL shown below shows a technique for filling values.

  1. The named subquery requests represents the original aggregated time window query.
  2. The named subquery args creates arguments for the gen_ts_candidates function. Alternatively you can directly input your args in the query.
  3. The generated candidate time series are then left outer joined to the original table to fill the missing rows.
WITH requests as (
SELECT
cast(supervisor_district as STRING) supervisor_district,
timeseries.tumble_interval(created_date, 900) as tumble,
count(*) as count
FROM `bigquery-public-data.san_francisco.311_service_requests` a
WHERE supervisor_district in (6, 11)
AND CAST(created_date as DATE) = '2018-01-18'
GROUP BY supervisor_district, tumble
),
args AS (
SELECT
array_agg(DISTINCT supervisor_district) as key,
min(tumble) as min_ts,
max(tumble) as max_ts
FROM requests
)
SELECT *
FROM UNNEST(
(SELECT
timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
FROM args)
) a
LEFT OUTER JOIN requests b
ON a.series_key = b.supervisor_district
AND a.tumble_val = b.tumble
ORDER BY tumble_val DESC

Filling with last value or a default

Once there is a record for each time window, the LAST_VALUE window function or a default value can be used to fill in the missing value for those new records. The LAST_VALUE function will take the last non-null value for the window specification, which includes the rows current value; while the default value will replace the current value with a constant, if it is null.

Building on the prior example, you can now see those empty windows filled with valid values. This example also orders by the series key, supervisor district, to make the LAST_VALUE functionality more apparent.

...
SELECT
series_key as supervisor_district,
tumble_val as tumble,
LAST_VALUE(count IGNORE NULLS)
OVER (PARTITION BY series_key
ORDER BY tumble_val ASC
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS last_value,
coalesce(count, 0) AS def,

count as unfilled
FROM UNNEST(
(SELECT
timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
FROM args)
) a
LEFT OUTER JOIN requests b
ON a.series_key = b.supervisor_district
AND a.tumble_val = b.tumble
ORDER BY supervisor_district, tumble_val DESC

Linear interpolation between valid values

Linear interpolation is a simple concept whose math will be immediately recognizable. The function finds the slope between preceding and following values and you use a straight line equation (y=mx + b) to determine the value of the current position. The trick is in providing the relative coordinates of the values being used. A BigQuery struct can be used as a complex type containing the (x,y) coordinates; the number of timestamp seconds can be used for relative position on the x-axis.

Create the linear_interpolation function and a convenience function, interpolate_ts_val_or_fill() to translate timestamps and handle error cases.

CREATE OR REPLACE function
timeseries.linear_interpolate(pos INT64,
prev STRUCT<x INT64,y FLOAT64>,
next STRUCT<x INT64,y FLOAT64>)
AS (
CASE
WHEN prev IS NULL OR next IS NULL THEN null
ELSE
--y = m * x + b
(next.y - prev.y)/(next.x - prev.x) * (pos - prev.x) + prev.y
END
);
CREATE OR REPLACE FUNCTION
timeseries.interpolate_ts_val_or_fill(value FLOAT64,
pos TIMESTAMP,
prev STRUCT<x TIMESTAMP, y FLOAT64>,
next STRUCT<x TIMESTAMP, y FLOAT64>,
def FLOAT64)
AS (
CASE
WHEN value IS NOT NULL THEN value
WHEN prev IS NULL OR next IS NULL THEN def
ELSE
timeseries.linear_interpolate(
unix_seconds(pos),
STRUCT(unix_seconds(prev.x) AS x, prev.y AS y),
STRUCT(unix_seconds(next.x) AS x, next.y AS y)
)
END
);

Using structs for the coordinates is important because they allow multiple values to be packaged in a way that can be picked up by the LAST_VALUE and FIRST_VALUE window functions.

Rewriting the full SQL used to interpolate times series gaps; the notable changes are marked in bold.

WITH requests as (
SELECT *,
IF(count is null, null,
STRUCT(tumble as x, count as y)
) AS coord
FROM (

SELECT
cast(supervisor_district as STRING) supervisor_district,
timeseries.tumble_interval(created_date, 900) as tumble,
count(*) as count
FROM `bigquery-public-data.san_francisco.311_service_requests` a
WHERE supervisor_district in (6, 11)
AND CAST(created_date as DATE) = '2018-01-18'
AND created_date < '2018-01-18 08:30:00'
GROUP BY supervisor_district, tumble
)
),
args AS (
SELECT
array_agg(DISTINCT supervisor_district) as key,
min(tumble) as min_ts,
max(tumble) as max_ts
FROM requests
)
SELECT
series_key as supervisor_district,
tumble_val as tumble,
timeseries.interpolate_ts_val_or_fill(
count,
tumble_val,
LAST_VALUE(coord IGNORE NULLS)
OVER (PARTITION BY series_key
ORDER BY unix_seconds(tumble_val) ASC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
FIRST_VALUE(coord IGNORE NULLS)
OVER (PARTITION BY series_key
ORDER BY unix_seconds(tumble_val) ASC
RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
0
) as intrp,

count as unfilled
FROM UNNEST(
(SELECT
timeseries.gen_ts_candidates(key, 900, min_ts, max_ts)
FROM args)
) a
LEFT OUTER JOIN requests b
ON a.series_key = b.supervisor_district
AND a.tumble_val = b.tumble
ORDER BY supervisor_district, tumble_val DESC

Comparing two series

At first glance, even with functions, this is too much SQL to line up and compare two series. A view can make the querying more manageable.

CREATE OR REPLACE VIEW timeseries.311_req_series
AS (
WITH requests AS (
SELECT *,
IF(count, null, null,
STRUCT(tumble as x, count as y)
) AS coord
...
);

Now a simple join can be used to compare between two series.

SELECT
ABS(a.intrp - b.intrp) AS difference,
*
FROM timeseries.311_req_series a
LEFT OUTER JOIN (
SELECT * FROM timeseries.311_req_series
WHERE supervisor_district = '6'
) b
ON a.tumble = b.tumble
WHERE a.supervisor_district = '11';

Additional resources

In a future post I plan on exploring sliding windows and alternative interpolation functions.

Google Cloud - Community

Google Cloud community articles and blogs

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store