Time series analytics with BigQuery

Techniques for tumbles, fills, and linear interpolation

Patrick Dunn
Google Cloud - Community
7 min readNov 13, 2020

--

(Update 12/01/2020: The user defined functions in this article have been added to the BigQuery community UDFs and slightly reworked. As an example, the tumble function can be invoked with bqutil.fn.ts_tumble())

As BigQuery is increasingly used as a store for real time analytic data such as web events and IoT; many users are asking for time series functions, similar to those found in a special purpose product like OpenTSDB. While BigQuery does not currently have dedicated time series functions, there are a few simple functions and techniques you can use to perform time series analytics directly within BigQuery.

For the purposes of demonstration, the samples in this guide use the public San Francisco 311 dataset, performing time series analysis using the created_date and supervisor_district for the timestamp and series key, respectively. These examples can be applied to common time series problems like log analytics or IoT monitoring.

This guide requires you to create new UDFs and assumes you will be creating them in a dataset named ‘timeseries’.

Creating and grouping by a tumble function

Sometimes called a time bucket or time slice, a tumble represents a non-overlapping time window the data can fall into. This differs from a timestamp_trunc() function because it can aggregate on a window of any number of seconds.

This function is fairly simple, it rounds down by the tumble size, allowing values to be aligned and aggregated into the same time group.

Testing the function, shows the tumble column which rounds down to 15 minute (900 seconds) intervals.

The rounded down tumbling window can be used to aggregate data for that time segment, e.g. the count of events that occurred during that time window.

Handling missing values

Looking at the query results, there is a problem. The time ranges that did not have any events, show up as completely missing records. This can be particularly problematic in IoT scenarios where an important metric is calculated between two or more sensor values.

The most common techniques for handling missing values are to: fill the time window with a default value like ‘0’, the nearest preceding valid value, or plot an estimated value using linear interpolation.

Generate timestamp candidates

In order to create synthetic rows in BigQuery we are going to rely on the UNNEST operation which converts an array of values into rows and the generate timestamp array function to create the array of candidate values to fill in the blanks. The synthetic rows can then be left outer joined onto the original time series aggregation to fill in blanks.

To simplify the task we will create the gen_ts_candidates(..) function that returns an array of struct values containing timestamp and series key.

Filling the record gaps with a join

The complex SQL shown below shows a technique for filling values.

  1. The named subquery requests represents the original aggregated time window query.
  2. The named subquery args creates arguments for the gen_ts_candidates function. Alternatively you can directly input your args in the query.
  3. The generated candidate time series are then left outer joined to the original table to fill the missing rows.

Filling with last value or a default

Once there is a record for each time window, the LAST_VALUE window function or a default value can be used to fill in the missing value for those new records. The LAST_VALUE function will take the last non-null value for the window specification, which includes the rows current value; while the default value will replace the current value with a constant, if it is null.

Building on the prior example, you can now see those empty windows filled with valid values. This example also orders by the series key, supervisor district, to make the LAST_VALUE functionality more apparent.

Linear interpolation between valid values

Linear interpolation is a simple concept whose math will be immediately recognizable. The function finds the slope between preceding and following values and you use a straight line equation (y=mx + b) to determine the value of the current position. The trick is in providing the relative coordinates of the values being used. A BigQuery struct can be used as a complex type containing the (x,y) coordinates; the number of timestamp seconds can be used for relative position on the x-axis.

Create the linear_interpolation function and a convenience function, interpolate_ts_val_or_fill() to translate timestamps and handle error cases.

Using structs for the coordinates is important because they allow multiple values to be packaged in a way that can be picked up by the LAST_VALUE and FIRST_VALUE window functions.

Rewriting the full SQL used to interpolate times series gaps; the notable changes are marked in bold.

Linear interpolation clearly seen between rows 5 and 9

Comparing two series

At first glance, even with functions, this is too much SQL to line up and compare two series. A view can make the querying more manageable.

Now a simple join can be used to compare between two series.

difference between the interpolated values of series 6 and 11

Additional resources

In a future post I plan on exploring sliding windows and alternative interpolation functions.

--

--