ARIMA: Forecast Large Time Series Datasets with RAPIDS cuML

Learn how to use the ARIMA model to accelerate time series forecasting.

Published in

RAPIDS AI

5 min readNov 12, 2020

Tarot cards — It has never been easier to predict the future

Forecasting is essential to efficiently plan for the future, e.g for the scheduling of stock or personnel. The AutoRegressive Integrated Moving Average (ARIMA) model and its derivatives are some of the most widely used tools for time series forecasting (along with Exponential Smoothing methods).

You can now accelerate these workflows with the GPU-accelerated ARIMA implementation available in RAPIDS cuML. While the work is still in progress, it provides support for seasonal models and already delivers speedups in comparison with statsmodels, which is a popular solution for Python users (and is used as a backend for sktime).

In this post, we will introduce ARIMA, help you get started, and provide a performance comparison with other popular implementations.

What is ARIMA and What Can It Do?

A time series is a sequence of data points taken at regularly spaced intervals in time. Time series forecasting is the process of predicting future data points, given some historical data. For a comprehensive introduction to time series forecasting, we recommend the online textbook Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. We will refer to some sections of this book in the following paragraphs.

The ARIMA model is a quantitative forecasting method: it assumes that patterns and trends in the past data will continue into the future. More precisely, it captures the autocorrelations in the data. A non-seasonal ARIMA(p,d,q) model comprises three components:

The Auto-Regressive (AR) part is a linear combination of p prior values of the variable.
The Moving Average (MA) part is a linear combination of q prior prediction errors (it should not be confused with moving averages which are a different concept in time series analysis).
The model is called Integrated (I) because the data is differenced d times to fulfill a stationarity requirement.

ARIMA formula as a linear combination of prior values and prediction errors — Expression of an ARIMA(p,d,q) model. We denote y the series of observations, ϕ and θ the parameters for the AR and MA parts of the model respectively, ϵ the prediction errors, μ the intercept parameter, and Δ the differencing operator

Although the formulation above can work well on some datasets, one key factor of accurate time series forecasting is to take into account seasonal patterns. Most time series show repetitive patterns with a fixed period every s time steps (s=24 for hourly data, s=7 for daily data with a weekly pattern, etc). The seasonal ARIMA model (sometimes called SARIMA) extends the previous definition with additional seasonal terms and the possibility of applying seasonal differencing. A seasonal ARIMA model is denoted ARIMA(p,d,q)(P,D,Q)_s, where P and Q are the numbers of periods included in the seasonal Auto-Regressive and Moving Average parts of the model, and D is the number of seasonal differences.

No model is guaranteed to perform well on any given time series, but methods like Exponential Smoothing and ARIMA are more versatile than naive methods, as illustrated in the figure below.

Comparison of forecasts computed with 4 different methods. — Comparison of the forecasts for the daily number of views on the Wikipedia page “Artificial Intelligence” with 4 different methods: seasonal last value (also known as seasonal naive), seasonal mean value, Holt-Winters (that we support in cuML) and a seasonal ARIMA model

Forecasting with cuML is Easy

In this section, we will quickly walk you through using RAPIDS cuML to compute forecasts. The ARIMA model is almost a drop-in replacement for statsmodels’ SARIMAX.

Let us start with statsmodels. We have a file data.csv that contains the names of the variables on the first row, and on the following rows a sequence of observations. We want to output forecasts in the same format, computed with an ARIMA(1,1,1)(1,1,1)_7 model. SARIMAX takes only a single series, so we need to use a loop:

With cuML, we can simply pass the full data to the ARIMA model:

To learn more about how to use cuML ARIMA, the available features, and to see it in action, you can read the dedicated introductory notebook. For exhaustive documentation, you can refer to the cuML API docs.

Performance Comparison with Statsmodels

There are many implementations of ARIMA, but we believe that for Python users the most relevant one to compare against is the statsmodels module, which is widely used either directly or through wrappers such as pmdarima and sktime.

For our benchmark, we chose a dataset of the daily views on Wikipedia articles and media, from the Kaggle competition Web Traffic Time Series Forecasting, because it is a large dataset and the series show seasonal patterns with a period s=7 (i.e weekly patterns). We pre-processed the dataset to include only Wikipedia articles for all-access/all-agents (35k series), which don’t have missing data (26.5k series).

We chose an ARIMA(1,1,1)(1,1,1)_7 model for its versatility. For simplicity, we are using a single model for the whole batch of series (we will talk in a later blog post about our AutoARIMA model which finds the best order to use for each series). We measured the time that it takes to fit the model parameters to the dataset for batch sizes from 1 to 10000. Processing a batch of time series with statsmodels requires a for loop to fit each series individually. It is trivial to parallelize such a loop using independent workers. We tested statsmodels with 40 workers, using Dask, to saturate the CPU and provide a more balanced comparison against cuML.

To measure the forecast accuracy, we used the seasonal Mean Absolute Scaled Error (MASE) metrics. This measure is scale-invariant, so we can visualize the results over the whole batch with a box plot.

Comparison of the fitting time and accuracy of statsmodels SARIMAX and cuML ARIMA. — Results of our benchmarks. For the CPU measurements, we used a server with a dual 20-Core Intel(R) Xeon(R) E5–2698 v4 2.2 GHz. For the GPU measurements, we used an NVIDIA V100 GPU.

Results and Key Takeaways

cuML ARIMA is meant to work with many series simultaneously. It is fast for a large batch of input data: processing 1000 series takes only 12 times longer than processing a single series. This is due to our implementation taking advantage of the parallel structure of GPUs to process many time series concurrently. The drawback is that processing a single time series is currently slower than statsmodels. For a large batch size (10000), on this dataset cuML is 5 times faster than the go-to solution for Python users running on 40 CPU cores.

Our implementation is still in development and the focus has been on adding new features so far. Other implementations such as R’s forecast package can be faster when run in parallel on many cores. The performance is expected to continue to improve in the future. But this first release is already more than fast enough for general use.

Known Limitations and Future Work

We plan to improve cuML’s ARIMA package in multiple ways. The first is performance: we aim to continually improve performance. A few upcoming optimizations have been listed in a GitHub issue. Moreover, we plan to add support for more functionality in the future, such as exogenous regressors and robustness to missing observations.

If you are interested in using ARIMA, please leave a comment to let us know. It will help us to prioritize the development of this model. If you have any feedback regarding the features, the API, or the performance, please reach out to us on GitHub or Slack. You can also ask questions on Stack Overflow.