BigQuery: Demystifying ARIMA_PLUS vs ARIMA_PLUS_XREG for Time Series Forecasting
Recently I was asked to do a time series forecasting POC for one of my customers in the Retail Industry that was exploring Google Cloud tools, and as many of you have probably seen, BigQuery provides powerful tools for time series analysis, and two of its forecasting functions can sometimes cause confusion: ARIMA_PLUS and ARIMA_PLUS_XREG. Both utilize ARIMA (Autoregressive Integrated Moving Average) models, but with a key difference in handling explanatory variables. It is critical to understand when to use one vs the other and before I show you how to know which one is better for your use case, let’s delve into what makes these 2 models distinct and how to choose the right one for your forecasting needs.
ARIMA_PLUS: A Self-Contained Forecasting Machine for Univariate Use Cases
ARIMA_PLUS is a versatile function that builds forecasts based solely on the historical data you provide. It follows a three-step process:
- Model Identification: It analyzes the time series to identify patterns and seasonality.
- Model Fitting: Based on the identified patterns, it estimates the parameters of an ARIMA model (p,d,q values).
- Forecasting: It leverages the fitted model to predict future values.
ARIMA_PLUS is a solid choice when you want to forecast a time series without the influence of external factors. It’s particularly useful for:
- Short-term forecasting: When you need to predict values for the immediate future based solely on past trends.
- Data with minimal external influences: When the time series you’re analyzing isn’t significantly affected by external variables.
ARIMA_PLUS_XREG: When External Variables Come into Play for Multivariate Use Cases
ARIMA_PLUS_XREG introduces the power of explanatory variables into the forecasting mix. It incorporates these additional data points alongside the historical time series to build a more comprehensive model.
Here’s how it works:
- Model Identification: Similar to ARIMA_PLUS, it analyzes the time series and explanatory variables.
- Model Fitting: It estimates the parameters of an ARIMA model that considers both the time series and the explanatory variables.
- Forecasting: It uses the fitted model to generate forecasts that take into account the impact of the explanatory variables.
ARIMA_PLUS_XREG shines when you have relevant external data that can influence your forecasts.
Keeping it simple, if you have variables you suspect have a correlation with your target metric, start with ARIMA_PLUS_XREG and then compare it with ARIMA_PLUS to see which one performs better.
One way to find if there’s correlation between the variables in your dataset is by using pearson correlation (like I did) in BQ to understand their behavior across your categorical columns. Pearson correlation is ideal for continuous data with a suspected linear relationship. In my use case, one of the target metrics was total impressions a web site gets at a given date and the results looked as follows.
At a timeseries_identifier (concatenation of all the categorical columns) level:
At a timeseries level (day):
Since we now know impressions, clicks and web visits show from moderate to high correlation in the majority of our observations (this kind of analysis is also helpful to spot any outliers in our data), we can proceed by including them in our XREG model.
CREATE OR REPLACE MODEL
`mydataset.Model_Impressions_XREG` OPTIONS (model_type = 'ARIMA_PLUS_XREG',
time_series_timestamp_col = 'date',
time_series_data_col = 'total_impressions',
auto_arima = TRUE,
clean_spikes_and_dips = FALSE,
adjust_step_changes = FALSE,
data_frequency = 'AUTO_FREQUENCY') AS
SELECT
date AS date,
SUM(impr_to_forecast) AS total_impressions,
SUM(clicks_to_forecast) AS total_clicks,
SUM(webvisits_to_forecast) AS total_webvisits
FROM
`mydataset.mydata`
WHERE
date BETWEEN DATE('2023-06-07')
AND DATE('2024-02-29')
GROUP BY
date;
One super important thing to keep in mind after you train the model, when using ARIMA_PLUS_XREG model for getting evaluation metrics and for getting predictions, make sure you provide the external regressors as well, which as you might’ve already guessed by now, those are going to be unknown for future data and therefore, you’ll have to use ARIMA_PLUS to predict the external variables first and use those as input for the ARIMA_PLUS_XREG model.
Note how I’m including 2 other ARIMA_PLUS models (trained with the same data and time periods as the ARIMA_PLUS_XREG model) to get the external regressors and include those outputs as inputs for our just trained model:
SELECT
*
FROM
ML.evaluate(MODEL `mydataset.Model_Impressions_XREG`,
(
SELECT
date,
SUM(total_impressions) AS total_impressions,
SUM(total_clicks) AS total_clicks,
SUM(total_webvisits) AS total_webvisits
FROM (
SELECT
date,
SUM(impr_to_forecast) AS total_impressions,
NULL AS total_clicks,
NULL AS total_webvisits
FROM
`mydataset.mydata`
WHERE
date > DATE('2024-02-29')
AND date <= DATE(DATE_ADD('2024-02-29', INTERVAL 20 day))
GROUP BY
date
UNION ALL
SELECT
date,
NULL AS total_impressions,
SUM(total_clicks) AS total_clicks,
SUM(total_webvisits) AS total_webvisits
FROM (
SELECT
EXTRACT(DATE
FROM
forecast_timestamp) AS date,
CAST(forecast_value AS int64) AS total_clicks,
NULL AS total_webvisits
FROM
ML.FORECAST(MODEL `mydataset.Model_Clicks`,
STRUCT(20 AS horizon,
0.8 AS confidence_level))
UNION ALL
SELECT
EXTRACT(DATE
FROM
forecast_timestamp) AS date,
NULL AS total_clicks,
CAST(forecast_value AS int64) AS total_webvisits
FROM
ML.FORECAST(MODEL `mydataset.Model_WebVisits`,
STRUCT(20 AS horizon,
0.8 AS confidence_level)))
GROUP BY
date )
GROUP BY
date),
STRUCT( TRUE AS perform_aggregation,
20 AS horizon));
In my case, this is what it looks like when getting evaluation metrics (comparing model’s results with factual data from March):
I compared the previous results with an ARIMA_PLUS model for the same target metric (impressions) and this is what I got:
We can see that the mean_absolute_percentage_error is lower in the ARIMA_PLUS_XREG model and therefore we’ll get better predictions out of it.
In Conclusion
BigQuery’s ARIMA forecasting functions provide flexibility for various time series analysis needs. By understanding the strengths of ARIMA_PLUS and ARIMA_PLUS_XREG, you can select the right tool to generate accurate and insightful forecasts, empowering better decision-making for your business.
Remember: Data exploration is key! Before diving into forecasting, visualize your time series and identify potential explanatory variables. This initial step will guide you towards the most suitable BigQuery function for your specific use case.