Hands-On Time-Series Forecasting in Python

“The goal of forecasting is not to predict the future but to tell you what you need to know to take meaningful action in the present” ~ Paul Saffo

Anjali Pal
Nerd For Tech
9 min readJun 15, 2021

--

Image by Chris Liverani on Unsplash

Time Series forecasting is an important concept in Statistics as well as data science. In this article, I’ll be covering an in-depth hands-on exercise for Time Series Analysis.

Before starting with the Python code, let’s discuss some very basic points of Time Series Analysis. For those who are new to it, I’ve attached some links in references to get the complete overview of Time Series Analysis.

As we know, statistical data is of 3 types:

  1. Time series data: A set of observations on the values that a variable takes at different times.
  2. Cross-sectional data: Data of one or more variables, collected at the same point in time.
  3. Pooled data: A combination of time series data and cross-sectional data.

Thus, the study of one variable for a period of time is called time series analysis.

The objective of time series analysis is to understand the process which is generating the series and forecast future values of a variable under study.

There are 4 components of a Time Series:

  1. Trend (T): This is the general, long term tendency of the data to increase or decrease with time.
  2. Seasonal (S): It is the non-random component that repeats itself at regular intervals.
  3. Cyclic [C]: Oscillatory movements with a period of more than 1 year are cyclic variations.
  4. Random (R): Any movement of Time Series that can’t be accounted for by the above 3 components, is accounted in random variation. These are unforeseen events, famines etc.

We always decompose a Time Series into different components as we might be interested in a particular component.

Mathematical Model of Time Series

  1. Additive: It is given by ->T+S+C+R
  2. Multiplicative: It is given by ->T*S*C*R

To read more on these models, click here

Once, we decompose a Time series, based on the type of mathematical model, we can separate various components. Later in this article, I’ll show how to do this using Python.

ARIMA and ARMA are 2 common models used in the analysis and forecasting of Time Series data.

ARMA

AutoRegressive Moving Average. It is denoted by ARMA(p,q).

In the AR model, the current value is expressed as a linear function of a finite number of previous values and a random shock.

In the MA model, the current value is expressed as a linear function of the current and finite number of previous shocks.

Thus, ARMA is a combination of both of these.

To get the value of p (lag order), we use PACF (Partial Autocorrelation Function) and for q (order of Moving average), we use ACF (Autocorrelation Function).

ARIMA

AutoRegressive Integrated Moving Average. It is denoted by ARIMA(p,d,q).

This model has both AR and MA, along with ‘I’ part. Here, the ‘integrated’ part means differencing. Its order is denoted by d and is called ‘Order of Differencing’. Differencing is done to make a series stationary.

To read more about these models, go to the ‘References’ section of this article and open the 2nd link.

Now, let’s start to code.

Dataset Information: This dataset has been provided by Cheenta for their course Data Science Projects which is offered for free. They’re are building a data science community for everyone to grow and learn for free. If you want to get added to the community, click here.

(P.S.:This is not a promotional post or comment. Anyone can join this community free of cost, interact with members and do these projects. So, if you’re looking to do projects and learn from others, give it a try!)

Importing necessary libraries

Getting Data

Knowing data

There are 4 countries, 23 states,27 cities and 28 airports in the dataset. Data has been collected for 262 distinct days starting from 16/03/20 to 02/12/20.

Centroid and Country are POINT and POLYGON structures, telling us that they are geographical locations.

Also, from the count, we can see, that all features have no missing value.

Removing features that aren’t important

‘Geography’ is a polygon feature meaning it resembles shape. So, we can conclude that geography tells us the shape of an airport. Since we don’t require it for our analysis. We’ll drop it.

The same reasoning applies to the ‘centroid’. Centroid probably tells us the latitude and longitude of the centre of the airport. Since we don’t have any use for that. We’ll drop it too.

ISO_3166_2 is some unique value for every state. We won’t be requiring it for time series analysis.

AggregationMethod is always ‘Daily’, so, it doesn’t provide any information. We can remove it.

No information on the version is provided. So, we’ll leave that from our analysis.

Univariate Analysis

This shows that there are around 250 data points on each airport except Santiago International Airport and Edmonton International.
This shows that all cities have more or less equal counts in data except New York. The most likely reason would be that it has more airports.
Here, all states have equal counts in data except Alberta, Quebec, California and New York. Again most likely reason must be the number of airports. We’ll come again on this in an in-depth analysis of countries.
Maximum data points are for the US followed by Canada. This is because the number of airports in the US and Canada is probably more than in Australia and Chile.

Checking our inferences of Univariate plots

Table to see airports in a country

Bivariate Analysis

Analysis for Chile

Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. Here the null hypothesis is that the TS is non-stationary. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.

We can conclude that our data is not stationary, hence, we need to make it stationary because all time-series models are stationary.

The authors of the KPSS test have defined the null hypothesis as the process is trend stationery, to an alternate hypothesis of a unit root series. KPSS test also suggests that our series is NOT stationary.

For more information, refer to reference 4 of this article.

Value of q is determined by taking the 1st significant lag after which acf value falls in the limits (blue shading) or become insignificant. Here, q=1.

Similarly, the Value of p comes out to be 3.

The value of p and q give us an idea. We have to run various models close to those values and take the best model. The best Model is the one that has the lowest AIC, highest log-likelihood and the coefficients of the model must be significant.

Tip: You can get the same model using ARIMA(6,1,0) on Chile’s PercentOf Baseline. Since I have previously differenced the series, I didn’t use ARIMA here.

Analysis for the USA

Here, we will take mean of all because, in the USA, we have 16 airports. So, we’ll take the mean of all values and model the data.

This series is stationary with time according to the ADF test. So, we’ll model it directly and no need for differencing.

The model will be best represented by ARMA(1,2).

The graph looks like this:

I’ve shown the main steps in the analysis in sequential order. To see all the steps of the analysis, please see my repository at this link.

Also, for practice, you can do a similar analysis for Australia and Canada.

Hope you liked the article. Feel free to post comments.

For more such articles on data science, follow me on medium.

--

--

Anjali Pal
Nerd For Tech

A data science enthusiast who believes that “It is a capital mistake to theorize before one has data”- Sherlock Holmes. Visit me at https://anjali001.github.io