Forecasting time series data
This blog post takes you through the steps of building a time series model. It follows the instructions documented in a notebook called Use statsmodels to forecast time series data, which is available on the IBM Watson Studio Gallery. The notebook uses Python 3.6 and the statsmodels
package. The data used is Consumer Prices, which is originally sourced from the International Labour Organization. It measures the Consumer Price Index of different countries over a period of time. This example focuses specifically on the Consumer Price Index in the United States from 1969–2008.
The Consumer Price Index (CPI) is defined as a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Forecasting this value is useful because it’s a valuable economic indicator used to predict the rate of inflation. It also affects decision making pertaining to income payments.
In this example, you will learn how to visualize and modify the time series data with the necessary packages. It details how to analyze the time series data and come to conclusions about the stationarity and optimal parameters of the time series data and its model, respectively. Once you’ve selected the model with tuned parameters, you will be able to efficiently forecast the time series data.
Learning goals
- Create a time series object
- Explore data — check the stationarity of the time series using seasonal decomposition and the Dicky-Fuller test
- Prepare data — stationarizing the series
- Optimize the ARIMA parameters and create the model — ACF and PACF plots used to identify parameters and perform grid search for the ARIMA model
- Train the model
- Test the model using forecasting
Create a time series object
Interacting with time series data is best done by converting your data into a time series object. As you import your data as a dataframe, you need to convert the Year column into its index, since each date is unique.
Stationarity
Before you start creating your model, you need to check if the time series data is stationary. This means that the mean, variance, and covariance of the data are constant and are not dependent on time. There are a couple ways to do this:
- Seasonal decomposition: You can decompose the data to observe its trend and seasonality. Here, you’ll notice that the time series data has an observable trend, which means it is dependent on time. It is therefore not stationary.
- Dicky-Fuller test: This test is based on the hypothesis that the time series is non-stationary (that is, it is dependent on time). When you perform this test, you’ll see that this null hypothesis cannot be rejected because the
Test Statistic
is larger than theCritical Values
. Therefore, you’d have to stick with the assumption that the series is not stationary.
Prepare the data
Before you build the ARIMA model to forecast this time series data, you must prepare your data. In order to proceed with this step, you’ll need to learn a little bit about the model you are building. This algorithm, ARIMA (Auto Regressive Integrated Moving Average), has 3 important parameters that must be optimized:
p
— This parameter is associated with the Auto-Regressive component of the ARIMA model. One way to estimate this value is to use the PACF (Partial Autocorrelation Function).d
— This parameter represents the Integration component of the model and is determined by differencing the time series data. The number of times the data is differenced in order to become stationary determines the value of this parameter.q
— This is the parameter that is related to the Moving Average part of the ARIMA model. This can be estimated using the ACF (Autocorrelation Function).
Stationarize the time series data
First, you’ll need to compute the order of difference, d
, by determining the number of differences required to make the time series data stationary. Differencing calculates the difference between consecutive observations, which removes trend. You can perform the Dicky-Fuller test to determine when the differences series is stationary, i.e., when the Test Statistic
is significantly smaller than the Critical Values
.
Model selection
You’ll notice that the time series data is stationary once the data has been differenced twice — so, the order of difference d
is 2.
To find estimates of the remaining parameters, q
and p
, you can plot the Autocorrelation and Partial Autocorrelation Functions, respectively.
In the Autocorrelation Function, the most significant lag outside the confidence interval is lag 2. Therefore, an estimate for q
would be 2.
Similarly, using the Partial Autocorrelation Function, you can estimate p
to be 2.
In order to obtain the optimized parameters, you can perform grid search using the auto_arima
function from the pmdarima package. The best practice to prepare for model building is to split the data set into training and testing sets. In this example, the training set contains all values from 1969–2000.
# Perform grid search for the ARIMA model.stepwise_model = auto_arima(us_consumer_prices[:'2000'], start_p=0, start_q=0, seasonal=False, d=2, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True)print(stepwise_model.aic())
Then, you fit the model using the optimized parameters with the lowest AIC value (a criterion that measures the model).
The auto_arima
function calculated that the optimal values for p
and q
are 0 and 1, respectfully, with an AIC value of approximately 71.6. Once you’ve trained and fit the model, it’s time to forecast the data.
Forecasting
Use your test set (years 2001–2008) to forecast the data. You can compare the forecasted values with the original Consumer Price Indices from the data set and observe whether they fall within the 95% confidence interval.
Note that the Consumer Price Index values predicted using the ARIMA model built in this example are very close to the original values from the data set.
Create your own time series forecast!
Follow the step-by-step instruction and use the Python code snippets in the time series notebook on the IBM Watson Studio Gallery. Here’s how you can get started with your notebook:
- If using Watson Studio Cloud, create an account if you don’t have one already. If you want to use Watson Studio Desktop, here is the link to download it.
- Create a new project in Watson Studio Cloud or Watson Studio Desktop, or Watson Studio Local.
- Navigate to the time series notebook and click the Add to project button (located at the upper-right corner of the page) to copy the notebook into your new project.
You can now successfully build a time series model and use it to forecast future data. Give it a try!
Data Sources
UNData: Consumer prices, general indices (2000=100). (2010). Retrieved from http://data.un.org/.
Consumer Price Index: U.S. Bureau Of Labor Statistics. Retrieved from https://www.bls.gov/cpi/.