Hands-On Time-Series Forecasting in Python
“The goal of forecasting is not to predict the future but to tell you what you need to know to take meaningful action in the present” ~ Paul Saffo
Time Series forecasting is an important concept in Statistics as well as data science. In this article, I’ll be covering an in-depth hands-on exercise for Time Series Analysis.
Before starting with the Python code, let’s discuss some very basic points of Time Series Analysis. For those who are new to it, I’ve attached some links in references to get the complete overview of Time Series Analysis.
As we know, statistical data is of 3 types:
- Time series data: A set of observations on the values that a variable takes at different times.
- Cross-sectional data: Data of one or more variables, collected at the same point in time.
- Pooled data: A combination of time series data and cross-sectional data.
Thus, the study of one variable for a period of time is called time series analysis.
The objective of time series analysis is to understand the process which is generating the series and forecast future values of a variable under study.
There are 4 components of a Time Series:
- Trend (T): This is the general, long term tendency of the data to increase or decrease with time.
- Seasonal (S): It is the non-random component that repeats itself at regular intervals.
- Cyclic [C]: Oscillatory movements with a period of more than 1 year are cyclic variations.
- Random (R): Any movement of Time Series that can’t be accounted for by the above 3 components, is accounted in random variation. These are unforeseen events, famines etc.
We always decompose a Time Series into different components as we might be interested in a particular component.
Mathematical Model of Time Series
- Additive: It is given by ->T+S+C+R
- Multiplicative: It is given by ->T*S*C*R
To read more on these models, click here
Once, we decompose a Time series, based on the type of mathematical model, we can separate various components. Later in this article, I’ll show how to do this using Python.
ARIMA and ARMA are 2 common models used in the analysis and forecasting of Time Series data.
AutoRegressive Moving Average. It is denoted by ARMA(p,q).
In the AR model, the current value is expressed as a linear function of a finite number of previous values and a random shock.
In the MA model, the current value is expressed as a linear function of the current and finite number of previous shocks.
Thus, ARMA is a combination of both of these.
To get the value of p (lag order), we use PACF (Partial Autocorrelation Function) and for q (order of Moving average), we use ACF (Autocorrelation Function).
AutoRegressive Integrated Moving Average. It is denoted by ARIMA(p,d,q).
This model has both AR and MA, along with ‘I’ part. Here, the ‘integrated’ part means differencing. Its order is denoted by d and is called ‘Order of Differencing’. Differencing is done to make a series stationary.
To read more about these models, go to the ‘References’ section of this article and open the 2nd link.
Now, let’s start to code.
Dataset Information: This dataset has been provided by Cheenta for their course Data Science Projects which is offered for free. They’re are building a data science community for everyone to grow and learn for free. If you want to get added to the community, click here.
(P.S.:This is not a promotional post or comment. Anyone can join this community free of cost, interact with members and do these projects. So, if you’re looking to do projects and learn from others, give it a try!)
Description of dataset:This dataset shows traffic to and from the Airport as a Percentage of the Traffic volume during the baseline period. The baseline period used for computing this metric is from 1st Feb to 15th March 2020.
Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_excel('F:\Cheenta\Project1\covid_impact_on_airport_traffic.xlsx')
There are 4 countries, 23 states,27 cities and 28 airports in the dataset. Data has been collected for 262 distinct days starting from 16/03/20 to 02/12/20.
Centroid and Country are POINT and POLYGON structures, telling us that they are geographical locations.
Also, from the count, we can see, that all features have no missing value.
Removing features that aren’t important
‘Geography’ is a polygon feature meaning it resembles shape. So, we can conclude that geography tells us the shape of an airport. Since we don’t require it for our analysis. We’ll drop it.
The same reasoning applies to the ‘centroid’. Centroid probably tells us the latitude and longitude of the centre of the airport. Since we don’t have any use for that. We’ll drop it too.
ISO_3166_2 is some unique value for every state. We won’t be requiring it for time series analysis.
AggregationMethod is always ‘Daily’, so, it doesn’t provide any information. We can remove it.
No information on the version is provided. So, we’ll leave that from our analysis.
fig1 = sns.countplot(x = 'AirportName', data = data , palette='rainbow_r')
fig1.set_title("Count for various Airports")
fig2 = sns.countplot(x = 'City', data = data , palette='viridis')
fig2.set_title("Count for various City")
fig3 = sns.countplot(x = 'State', data = data , palette='cividis')
fig3.set_title("Count for various State")
fig4 = sns.countplot(x = 'Country', data = data , palette='summer')
fig4.set_title("Count for various Country")
Checking our inferences of Univariate plots
fig5 = sns.pairplot(data,hue='Country',height=5,palette='husl',aspect=1)
plt.title("Distribution of Percent of Baseline for different countries")
plt.legend(loc = 'upper right',bbox_to_anchor=(1.2, 0., 0.5, 0.5))
Analysis for Chile
# Data on Chile is based on only one state which has one city with one airport named Santiago International airport.data_chile = data_chile.sort_values(by="Date")data_chile.set_index('Date',inplace=True)
data_chile = data_chile.drop(columns=['AirportName','City','State','Country'])data_chile.head()
plt.title("Plot for PercentOfBaseline Vs Time for Chile")
from statsmodels.tsa.stattools import adfuller
print('Results of Dickey-Fuller Test:')
dftest = adfuller(data_chile['PercentOfBaseline'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest.items():
dfoutput['Critical Value (%s)'%key] = value
Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. Here the null hypothesis is that the TS is non-stationary. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.
We can conclude that our data is not stationary, hence, we need to make it stationary because all time-series models are stationary.
#define function for kpss test
from statsmodels.tsa.stattools import kpss
print ('Results of KPSS Test:')
kpsstest = kpss(data_chile['PercentOfBaseline'], regression='c')
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest.items():
kpss_output['Critical Value (%s)'%key] = value
The authors of the KPSS test have defined the null hypothesis as the process is trend stationery, to an alternate hypothesis of a unit root series. KPSS test also suggests that our series is NOT stationary.
For more information, refer to reference 4 of this article.
data_chile['diff'] = data_chile['PercentOfBaseline'] - data_chile['PercentOfBaseline'].shift(1)
plt.title("Plot for lagged PercentOfBaseline Vs Time for Chile")
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(x=data_chile['PercentOfBaseline'].dropna(),model='multiplicative',freq=9)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plt.show() #q=1 or q=0
Value of q is determined by taking the 1st significant lag after which acf value falls in the limits (blue shading) or become insignificant. Here, q=1.
plt.show() # p=3,5
Similarly, the Value of p comes out to be 3.
df = pd.DataFrame(data_chile['diff'])
from statsmodels.tsa.arima_model import ARMA
# Create Training and Test
train = df.iloc[:212]
test = df.iloc[212:]
# Build Model
model = ARMA(train, order=(6,0))
fitted = model.fit()
fc, se, conf = fitted.forecast(25, alpha=0.05) # 95% conf# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)# Plot
plt.fill_between(lower_series.index, lower_series, upper_series,color='g', alpha=.05)
plt.title('Forecast vs Actuals')
The value of p and q give us an idea. We have to run various models close to those values and take the best model. The best Model is the one that has the lowest AIC, highest log-likelihood and the coefficients of the model must be significant.
Tip: You can get the same model using ARIMA(6,1,0) on Chile’s PercentOf Baseline. Since I have previously differenced the series, I didn’t use ARIMA here.
Analysis for the USA
data_US = data[data['Country']=='United States of America (the)']
df1 = pd.DataFrame(data_US.groupby('Date',as_index=True)['PercentOfBaseline'].mean())
Here, we will take mean of all because, in the USA, we have 16 airports. So, we’ll take the mean of all values and model the data.
plt.title("Plot of USA's average PercentOfBaseline Vs Time")
This series is stationary with time according to the ADF test. So, we’ll model it directly and no need for differencing.
The model will be best represented by ARMA(1,2).
The graph looks like this:
I’ve shown the main steps in the analysis in sequential order. To see all the steps of the analysis, please see my repository at this link.
Also, for practice, you can do a similar analysis for Australia and Canada.
Hope you liked the article. Feel free to post comments.
For more such articles on data science, follow me on medium.