Time Series Forecast : A basic introduction using Python.
Time series data is an important source for information and strategy used in various businesses. From a conventional finance industry to education industry, they play a major role in understanding a lot of details on specific factors with respect to time. I recently learnt the importance of Time series data in the telecommunication industry and wanted to brush up on my time series analysis and forecasting information. So I decided to work through a simple example using python and I have explained all the details in this blog.
Time series forecasting is basically the machine learning modeling for Time Series data (years, days, hours…etc.)for predicting future values using Time Series modeling .This helps if your data in serially correlated.
Loading and Handling Time Series in Pandas.
I will be using python in jupyter notebook. Pandas in python has libraries that are specific to handling time series object .You can check out this documentation for more details .
Let’s start with the Preliminaries
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
To load the data- I have provided the link to my GitHub where the dataset and the code is available. I will be using the AirPassenger dataset from. Are you ready?
data = pd.read_csv('AirPassengers.csv')
print '\n Data Types:'
Looking at the output .
The data contains a particular month and number of passengers travelling in that month .The data type here is object (month) Let’s convert it into a Time series object and use the Month column as our index.
from datetime import datetime
#check datatype of index
You can see that now the data type is ‘datetime64[ns]’.Now let’s just make it into a series rather than a data frame ( this would make it easier for the blog explanation )
#convert to time series:
ts = data['#Passengers']
Let’s explore the various properties date-time based index:
This is a very important concept in Time Series Analysis. In order to apply a time series model, it is important for the Time series to be stationary; in other words all its statistical properties (mean,variance) remain constant over time. This is done basically because if you take a certain behavior over time, it is important that this behavior is same in the future in order for us to forecast the series. There are a lot of statistical theories to explore stationary series than non-stationary series. (Thus we can bring the fight to our home ground!)
In practice we can assume the series to be stationary if it has constant statistical properties over time and these properties can be:
• constant mean
• constant variance
• an auto co-variance that does not depend on time.
These details can be easily retrieved using stat commands in python.
The best way to understand you stationarity in a Time Series is by eye-balling the plot:
It’s clear from the plot that there is an overall increase in the trend,with some seasonality in it.
I have written a function for it as I will be using it quite often in this Time series explanation. But before we get to that,let me explain all the concepts in the function.
Plotting Rolling Statistics :The function will plot the moving mean or moving Standard Deviation. This is still visual method
NOTE: moving mean and moving standard deviation — At any instant ‘t’, we take the mean/std of the last year which in this case is 12 months)
Dickey-fuller Test :This is one of the statistical tests for checking stationarity. First we consider the null hypothesis: the time series is non- stationary. The result from the rest will contain the test statistic and critical value for different confidence levels. The idea is to have Test statistics less than critical value, in this case we can reject the null hypothesis and say that this Time series is indeed stationary (the force is strong with this one !!)
- Standard deviation (instead of variance)
- Plot original series
- Plot mean
- Plot std
- Plot Dickey-Fuller test
from statsmodels.tsa.stattools import adfuller
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
#Plot rolling statistics:
plt.plot(rolmean, color='red', label='Rolling Mean')
plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.title('Rolling Mean & Standard Deviation')
#Perform Dickey-Fuller test:
print 'Results of Dickey-Fuller Test:'
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest.items():
dfoutput['Critical Value (%s)'%key] = value
Now let’s parse our time series data into this function:
This is not stationary because :
• mean is increasing even though the std is small.
• Test stat is > critical value.
• Note: the signed values are compared and the absolute values.
MAKING THE TIME SERIES STATIONARY
There are two major factors that make a time series non-stationary. They are:
• Trend: non-constant mean
• Seasonality: Variation at specific time-frames
The basic idea is to model the trend and seasonality in this series, so we can remove it and make the series stationary. Then we can go ahead and apply statistical forecasting to the stationary series. And finally we can convert the forecasted values into original by applying the trend and seasonality constrains back to those that we previously separated.
Let’s start by working on the trend piece.
The first step is to reduce the trend using transformation, as we can see here that there is a strong positive trend. These transformation can be log, sq-rt, cube root etc . Basically it penalizes larger values more than the smaller. In this case we will use the logarithmic transformation.
There is some noise in realizing the forward trend here. There are some methods to model these trends and then remove them from the series. Some of the common ones are:
• Smoothing: using rolling/moving average
• Aggression: by taking the mean for a certain time period (year/month)
I will be using Smoothing here.
In smoothing we usually take the past few instances (rolling estimates) We will discuss two methods under smoothing- Moving average and Exponentially weighted moving average.
Moving average -
First take x consecutive values and this depends on the frequency if it is 1 year we take 12 values. Lucky for us that Pandas has a function for rolling estimate (“alright alright alright” -Matthew McConaughey!)
Now subtract the rolling mean from the original series.
The reason there are null values is because we take the average of first 12 so 11 values are null. We can also see that in the visual representation. Thus it is dropped for further analysis. Now let’s parse it to the function to check for stationarity.
We notice two things:
• The rolling values are varying slightly but there is no specific trend.
• The test statistics is smaller than the 5 % critical values. That tells us that we are 95% confident that this series is stationary.
In this example we can easily take a time period (12 months for a year), but there are situations where the time period range is more complex like stock price etc. So we use the exponentially weighted moving average (there are other weighted moving averages but for starters, lets use this). The previous values are assigned with a decay factor. Pandas again comes to the rescue with some awesome functions for it, like:
the parameter (halflife) is assumed 12, but it really depends on the domain. Let’s check stationarity now:
It is stationary because:
• Rolling values have less variations in mean and standard deviation in magnitude.
• the test statistic is smaller than 1% of the critical value. So we can say we are almost 99% confident that this is stationary.
Seasonality (along with Trend)
Previously we saw just trend part of the time series, now we will see both trend and seasonality. Most Time series have trends along with seasonality. There are two common methods to remove trend and seasonality, they are:
• Differencing: by taking difference using time lag
• Decomposition: model both trend and seasonality, then remove them
Here we first take the difference of the value at a particular time with that of the previous time. Now let’s do it in Pandas.
Looks ok to me but let’s parse it using our stationary testing function
It is stationary because:
• the mean and std variations have small variations with time.
• test statistic is less than 10% of the critical values, so we can be 90 % confident that this is stationary.
Here we model both the trend and the seasonality, then the remaining part of the time series is returned. Guess what? Yup, we have some awesome function for it. Let’s check it out:
Remove the trend and seasonality from the Time series and now we can use the residual values. Let’s check stationarity.
This is stationary because:
• test statistic is lower than 1% critical values.
• the mean and std variations have small variations with time.
Forecasting a Time Series
Now that we have made the Time series stationary, let’s make models on the time series using differencing because it is easy to add the error , trend and seasonality back into predicted values .
We will use statistical modelling method called ARIMA to forecast the data where there are dependencies in the values.
Auto Regressive Integrated Moving Average(ARIMA) — It is like a liner regression equation where the predictors depend on parameters (p,d,q) of the ARIMA model .
Let me explain these dependent parameters:
• p : This is the number of AR (Auto-Regressive) terms . Example — if p is 3 the predictor for y(t) will be y(t-1),y(t-2),y(t-3).
• q : This is the number of MA (Moving-Average) terms . Example — if p is 3 the predictor for y(t) will be y(t-1),y(t-2),y(t-3).
• d :This is the number of differences or the number of non-seasonal differences .
Now let’s check out on how we can figure out what value of p and q to use. We use two popular plotting techniques; they are:
• Autocorrelation Function (ACF): It just measures the correlation between two consecutive (lagged version). example at lag 4, ACF will compare series at time instance t1…t2 with series at instance t1–4…t2–4
• Partial Autocorrelation Function (PACF): is used to measure the degree of association between y(t) and y(t-p).
The two dotted lines on either sides of 0 are the confidence intervals. These can be used to determine the ‘p’ and ‘q’ values as:
• p: The first time where the PACF crosses the upper confidence interval, here its close to 2. hence p = 2.
• q: The first time where the ACF crosses the upper confidence interval, here its close to 2. hence p = 2.
Now, using this make 3 different ARIMA models considering individual as well as combined effects. I will also print the RSS for each. Please note that here RSS is for the values of residuals and not actual series.
• ARIMA =1.0292
ARIMA has the best RSS values.
FINAL STEP: BRINGING THIS BACK TO THE ORIGINAL SCALE
• First get the predicted values and store it as series. You will notice the first month is missing because we took a lag of 1(shift).
• Now convert differencing to log scale: find the cumulative sum and add it to a new series with a base value (here the first-month value of the log series).
• Next -take the exponent of the series from above (anti-log) which will be the predicted value — the time series forecast model.
- Now plot the predicted values with the original.
• Find the RMSE
The result can be further refined to get a better model. The scope of the blog was to quickly introduce Time Series Forecasting. Hope you guys enjoyed the blog, there a lot more details with respect Time series analysis and forecasting. This is a good place to understanding the theory behind the practical techniques .