Automate a simple Times Series prediction model in 10 minutes with BigQuery

Published in

Cityscoot XP blog

11 min readOct 17, 2023

Introduction

It is common to make demand forecasts in order to optimize operations (resources, inventory management etc.) or adapt marketing, promotions and pricing strategies.

In this article, we are going to make predictions on future number of daily rides of Chicago taxi company for the next month. The objective of this article is to show that we can experiment, create and automate a simple time series model in a few minutes using SQL without external infrastructure needed thanks to BigQuery capability.

This is an overfly / experimentation, we are not going to deep dive in the model, automatization, observability or visualization considerations.

To illustrate this work, we are going to train a model on data more than two months in the past (until end of July 2023) and evaluate it regarding the next month (August 2023) as illustrated on this figure :

Error measures between previsions and real data on one month

We are going to 1. make a simple prediction time series model with BigQuery, 2. automate the work and 3. expose it as a dashboard.

source : https://cloud.google.com/blog/products/data-analytics/get-started-with-data-analytics-demand-forecasting-with-ml-models?hl=en

1. Time Series Prediction

Training and evaluation data

Chicago taxi rides data is available in a BigQuery public dataset, the table we are going to use is here. The schema description is :

First, we need to get rides by day to build our time series (we also get the daily revenues) :

SELECT TIMESTAMP_TRUNC(trip_end_timestamp,DAY) AS day,
			 ROUND(SUM(fare), 1) AS daily_revenues,
       COUNT(DISTINCT unique_key) AS total_rides
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
GROUP BY day
ORDER BY day DESC;

After a quick overview, we see that the fares have a lot of null values on specific periods (from May 2020 to October 2021 for instance).

Rides on the left and daily revenues on the right

We use rides for our predictions because it seems to be more convenient for this article.

What’s more data freshness seems to be on a monthly basis ( here we would like to cut the 2 last records which are outliers).

We filter these records, and especially we place ourselves 38 days in the past because it was convenient for evaluating the model on August and all data is relevant while writing this article.

As a reminder, in the whole article feel free to adapt the time periods to your use case. Here it just an illustration.

To filter data according to our example time period :

SELECT TIMESTAMP_TRUNC(trip_end_timestamp,DAY) AS day,
		   ROUND(SUM(fare), 1) AS daily_revenues,
		   COUNT(DISTINCT unique_key) AS total_rides
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
WHERE CAST(trip_end_timestamp AS DATE) < DATE_SUB(CURRENT_DATE(), INTERVAL 38 DAY) 
      AND CAST(trip_end_timestamp AS DATE) > "2010-01-01"
GROUP BY day
ORDER BY day DESC;

We also remove too old data to remove observations at unix time (1901–01–01) for example.

We are going to add a 30 days lag (INTERVAL 68 DAY) because, to evaluate the model, we choose to train the model on the all data one month in the past.

We finally create a BigQuery dataset demand_prediction in our GCP project and here is the script to create the table with the above training data :

CREATE OR REPLACE TABLE demand_prediction.training_data_prev_rides_by_day 
PARTITION BY day
  OPTIONS (
    require_partition_filter = FALSE)
AS (
  SELECT CAST(TIMESTAMP_TRUNC(trip_end_timestamp,DAY) AS DATE) AS day,
			   ROUND(SUM(fare), 1) AS daily_revenues,
         COUNT(DISTINCT unique_key) AS total_rides
  FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
  WHERE CAST(trip_end_timestamp AS DATE) < DATE_SUB(CURRENT_DATE(), INTERVAL 68 DAY)
  GROUP BY day
);

As you can see we make a partition on day to limit the volume to scan.

And voilà, we have a training dataset ready for time series modeling purpose.

As evaluation model, we create a similar table with the next month data :

CREATE OR REPLACE TABLE demand_prediction.evaluation_data_prev_rides_by_day 
PARTITION BY day
  OPTIONS (
    require_partition_filter = FALSE)
AS (
  SELECT CAST(TIMESTAMP_TRUNC(trip_end_timestamp,DAY) AS DATE) AS day,
			   ROUND(SUM(fare), 1) AS daily_revenues, 
				 COUNT(DISTINCT unique_key) AS total_rides
  FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
  WHERE CAST(trip_end_timestamp AS DATE) >= DATE_SUB(CURRENT_DATE(), INTERVAL 68 DAY) 
        AND CAST(trip_end_timestamp AS DATE) < DATE_SUB(CURRENT_DATE(), INTERVAL 38 DAY)
  GROUP BY day
);

Training a time series model using BigQuery ML

BigQuery ML overview

BigQuery ML empowers machine learning capability in your data warehouse. It is convenient to gain time because :

you train and access models in BigQuery using SQL, a familiar language to data analysts.

2. you can train a model on the go, without moving data or developing a service outside :

complexity is reduced because fewer tools are required;
speed to production is increased because moving and formatting large amounts of data isn’t required.

Different models are supported by BigQueryML. You can find the list here.

Time series with ARIMA

To perform time series predictions, BigQuery ML supports ARIMA model. The model automatically handles anomalies, seasonality, and holidays.

More specifically, we are going to use ARIMA_PLUS including the support of a new training option DECOMPOSE_TIME_SERIES , and table-valued functions ML.ARIMA_EVALUATE and ML.EXPLAIN_FORECAST.

To train our model on training data, we use this query :

CREATE OR REPLACE MODEL demand_prediction.arima_model
OPTIONS(
    MODEL_TYPE='ARIMA_PLUS',
    TIME_SERIES_TIMESTAMP_COL='ts', 
    TIME_SERIES_DATA_COL='total_rides',
    HOLIDAY_REGION='US', 
  HORIZON=90
  ) AS (SELECT 
            CAST(day AS TIMESTAMP) ts,
            total_rides
        FROM
            demand_prediction.training_data_prev_rides_by_day);

We train here the model to make prediction for the 90 next days ( HORIZON=90) . You can adapt this accordingly to your use case.

We can precise :

seasonalities : DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }
specific holidays when needed. You can combine use of the holiday_statement argument and the HOLIDAY_REGION option to enable several different custom holiday scenarii, as described in the following sections.

To see the holidays when choosing ‘FR’ region :

SELECT * 
FROM bigquery-public-data.ml_datasets.holidays_and_events_for_forecasting 
WHERE region = 'FR'

Included holidays for France (`FR` region)

The objective of this article is not to optimize the model so that we are not going to deep dive on how it works and what is the best set of parameters. However, in a few words, ARIMA means AutoRegressive Integrated Moving Average and here are the main things to keep in mind :

AutoRegressive (AR): Think of this as the memory of the past. ARIMA looks at how the current value in your data relates to previous values.
Integrated (I): This part deals with making the data stationary. Stationary means that the statistical properties of the data don’t change over time.
Moving Average (MA): Now, think of this as the short-term memory. It looks at how random fluctuations in the past affect the current value.

ARIMA combines these three components to make predictions. It remembers the past (AR), makes the data suitable for analysis by filtering out noise (I), and considers short-term fluctuations (MA) to make an educated guess about what happens next.

By adjusting the parameters of ARIMA, you can create a model that is good at predicting different types of time series data.

Evaluating Time series metrics

Using ML.ARIMA_EVALUATE we can easily inspect the ARIMA model :

SELECT
  *
FROM
  ML.ARIMA_EVALUATE(MODEL `demand_prediction.arima_model`, 
										STRUCT(FALSE AS show_all_candidate_models))

It returns evaluation metrics attached to our model ( weekly and yearly periods, Arima parameters etc. ).

Set of parameters of our model

The first four columns (non_seasonal_{p,d,q} and has_drift) define the ARIMA model. p is used for AutoRegressive part, d for Integrated and q for MovingAverage part. You can see more here if you are interested by the maths behind. has_drift parameter corresponds to whether you want to include a drift component which accounts for a constant linear trend over time in your time series data.
The next three columns (log_likelihood, AIC, and variance) are relevant to find automatically the best ARIMA model. As a rule-of-thumb, the lower the AIC score, the better it is.
The seasonal_periods detected and which can be DAILY, WEEKLY, MONTHLY, QUARTERLY or YEARLY.

More information here.

You can also see your models candidates when navigating on Evaluation tab of the model.

By the way, for fine-tuning, you can inspect and set your own ARIMA coefficients. To see them, it is really straight forward :

SELECT
*
FROM
ML.ARIMA_COEFFICIENTS(MODEL bqmlforecast.arima_model)

Measuring model performance

ML.EVALUATE is also useful to evaluate our new fresh model comparing your forecast with evaluation data.

For our use case, we compare our estimates and real data during august 2023 (evaluation dataset created before). To do that, here is the query :

SELECT
  *
FROM
  ML.EVALUATE(MODEL `demand_prediction.arima_model`,
    (
    SELECT
      CAST(day AS TIMESTAMP) ts,
      total_rides
    FROM
      `demand_prediction.evaluation_data_prev_rides_by_day`),
    STRUCT(TRUE AS perform_aggregation, 30 AS horizon))

We get performance metrics like MAE and MSE, RMSE.

For our simple test, we have ~1200 rides as Mean absolute error : it means that the predictions for each point of the next month were off by ~1200 rides from the actual values. It represents about 7% of absolute error.

Forecasts

Making predictions

We have a first quite robust model and we would like to make forecasts for the next month (August 2023) with confidence interval at 90%. It means we would like to be pretty confident that the actual demand will fall within a certain range, and there’s a 10% chance it might be outside that range. To do that, we use ML.FORECAST() function and confidence_level parameter :

DECLARE HORIZON STRING DEFAULT "30"; #number of values to forecast
DECLARE CONFIDENCE_LEVEL STRING DEFAULT "0.9"; #attached confidence interval

EXECUTE IMMEDIATE format("""
    SELECT
      DATE(forecast_timestamp) AS date, 
      CAST(forecast_value AS INT64) AS forecast_rides, 
   CAST(prediction_interval_lower_bound AS INT64) AS forecast_rides_lower_bound,
    CAST(prediction_interval_upper_bound AS INT64) AS forecast_rides_upper_bound
    FROM 
      ML.FORECAST(MODEL demand_prediction.arima_model, 
                  STRUCT(%s AS horizon, 
                         %s AS confidence_level)
                 )
    """, HORIZON, CONFIDENCE_LEVEL)

We get the results directly for the forecasts :

Daily forecasts with confidence interval

Visualization

We can visualize it quickly and we see that forecast are pretty ok with real data. To do that, we have to prepare our data to be able to visualize it :

WITH forecast_rides AS (
      SELECT
          DATE(forecast_timestamp) AS date,
				  CAST(forecast_value AS INT64) AS forecast_rides, 
					CAST(prediction_interval_lower_bound AS INT64) AS forecast_rides_lower_bound,
				  CAST(prediction_interval_upper_bound AS INT64) AS forecast_rides_upper_bound
        FROM 
          ML.FORECAST(MODEL demand_prediction.arima_model, STRUCT(30 AS horizon, 0.9 AS confidence_level)) ),
                 
    real_rides AS (
        SELECT day, total_rides
        FROM demand_prediction.training_data_prev_rides_by_day 
        UNION ALL
        SELECT day, total_rides
        FROM demand_prediction.evaluation_data_prev_rides_by_day 
        ORDER BY day DESC) 
SELECT day, total_rides, forecast_rides, forecast_rides_lower_bound,
    forecast_rides_upper_bound
FROM real_rides 
LEFT JOIN forecast_rides 
ON real_rides.day = forecast_rides.date
ORDER BY day DESC;

With this result, we can easily visualize it with Looker Studio for example :

Forecasts in grey and real data in purple. The two other curves (pink and green) are rspectively the lower and upper bounds according to our confidence interval.

2. Automating the work

To automate this, we would like, every month, to :

re-train our time series model and evaluate it on the last month data;
make forecast (thanks to all data) on the very next month;

This is an example, here the objective is to show a simple way to automate this kind of work. We are going use scheduled queries for this stuff.

Re-train our time series model to re-evaluate

Given that there is a pretty large volume of data to scan each time and it is convenient to use our day partition, I suggest to update training and evaluation data thanks to a scheduled query instead of using a view.

We also use a scheduled query for the model.

To schedule these queries, we can do it directly via the schedule button on the UI :

You can schedule a query directly on the UI

Then we have to input our requirements (e.g., repeats Monthly) and select “Schedule”.

For this example, we schedule it every 1st day of the month at 6 am UTC.

We now can monitor your your scheduled queries on the BigQuery Scheduled Queries page.

This will automatically re-train our model and allow us to evaluate it on last month data.

Making forecast for the next month

We need to get all the time series data and make predictions in the future. We can make a view with all data called full_data_prev_rides_by_day :

WITH real_rides AS (
        SELECT day, total_rides
        FROM demand_prediction.training_data_prev_rides_by_day 
        UNION ALL
        SELECT day, total_rides
        FROM demand_prediction.evaluation_data_prev_rides_by_day 
        ORDER BY day DESC ) 
SELECT day, total_rides,
FROM real_rides 
ORDER BY day DESC;

View with all daily rides (data source of our experimentation)

And then we use our trained model to make forecast in the future on the next two months (we can also train our model with all data and twist evaluation). In this case, performances are similar so that we are going to automate forecast with our pre-trained model :

Scheduling forecasts with confidence interval on all data

On a monthly basis, we are now able to regularly :

re-train a model and evaluate its performance on the last month;
make forecast for the next month.

3. Packaging the work in one dashboard

All the technical stuff is pretty ok even if we could do a lot of things to improve all of this ! The point is that we can easily use BigQuery ML to experiment and automate a first time series model in a few minutes.

In order to make it actionnable and usable, we quickly packaged this work to illustrate model performance on the last month and future previsions in a simple dashboard.

I suggest to build a view with analytical stuff :

WITH forecast_rides AS (
      SELECT
          DATE(forecast_timestamp) AS date,
          CAST(forecast_value AS INT64) AS forecast_rides,
          CAST(prediction_interval_lower_bound AS INT64) AS forecast_rides_lower_bound,
          CAST(prediction_interval_upper_bound AS INT64) AS forecast_rides_upper_bound
        FROM 
          ML.FORECAST(MODEL demand_prediction.arima_model, STRUCT(60 AS horizon, 0.8 AS confidence_level)) ),
                 
    real_rides AS (
        SELECT day, total_rides
        FROM demand_prediction.training_data_prev_rides_by_day 
        UNION ALL
        SELECT day, total_rides
        FROM demand_prediction.evaluation_data_prev_rides_by_day 
        ORDER BY day DESC) 
SELECT forecast_rides.date, total_rides, forecast_rides, forecast_rides_lower_bound, forecast_rides_upper_bound
FROM forecast_rides 
LEFT JOIN real_rides
ON real_rides.day = forecast_rides.date
ORDER BY forecast_rides.date DESC;