Time Series Forecasting With ARIMA Model in Python for Temperature Prediction

Nachiketa Hebbar
The Startup
Published in
8 min readSep 18, 2020

Time Series forecasting is one of the most in-demand techniques of data science, be it in stock trading, predicting business sales or weather forecasting. It is clearly a very handy skill to have and I am gonna equip you with just that by the end of this article.

In this tutorial, we are gonna build an ARIMA model(don’t worry if you do not exactly know how this works yet) to predict the future temperature values of a particular city using python. GitHub link for the code and data set can be found at the end of this blog. I have also attached my YouTube video at the end, in case you are interested in a video explanation. So without wasting any time let’s get started.

Reading Your Data

The first step in any time series is to read your data and see how it looks like. The following code snippet demonstrates how to do that.

import pandas as pd
df=pd.read_csv('/content/MaunaLoaDailyTemps.csv',index_col='DATE' ,parse_dates=True)
df=df.dropna()
print('Shape of data',df.shape)
df.head()
df

The code is pretty straightforward. We read the data using pd.read_csv and writing parse_date=True, makes sure that pandas understands that it is dealing with date values and not string values.

Next we drop any missing values and print the shape of the data. df.head() prints the first 5 rows of the dataset. Here is the output you should see for this:

Plot Your data

The next is to plot out your data. This gives you an idea of whether the data is stationary or not. For those who don’t what stationarity means, let me give you a gist of it. Although i have made several videos on this topic, it all boils down to this:

Any time series data that has to be modeled needs to be stationary. Stationary means that it’s statistical properties are more or less constant with time. Makes sense, right? How else are you supposed to make predictions if the statistical properties are varying with time? These are the following properties that any stationarity model will have:

  1. Constant Mean
  2. Constant Variance(There can be variations, but the variations shouldn’t be irregular)
  3. No seasonality(No repeating patterns in the data set)

So first step is to check for stationarity. If your data set is not stationary, you’ll have to convert it to a stationary series. Now before you start worrying about all of this, relax! We have a fixed easy test to check for stationarity called the ADF(Augmented Dickey Fuller Test). But before showing that, lets plot the data first.

Since I am only interested in predicting the average temperature, that is the only column I will be plotting.

df['AvgTemp'].plot(figsize=(12,5))

Checking For Stationarity

Right off the bat, we can see that it seems to have somewhat of a constant mean around 45. And the fluctuations also seem to be more or less the same. However to be sure if the data is stationary or not, we run a fixed statistical test using the following code:

from statsmodels.tsa.stattools import adfullerdef ad_test(dataset):
dftest = adfuller(dataset, autolag = 'AIC')
print("1. ADF : ",dftest[0])
print("2. P-Value : ", dftest[1])
print("3. Num Of Lags : ", dftest[2])
print("4. Num Of Observations Used For ADF Regression:", dftest[3])
print("5. Critical Values :")
for key, val in dftest[4].items():
print("\t",key, ": ", val)
adf_test(df['AvgTemp'])

You will get the output as follows:

You don’t need to worry about all the complex statistics. To interpret the test results, you only need to look at the p value. And you use the following simple method:

If p< 0.05 ; Data is stationary

if p>0.05; Data is not stationary

It’s not a hard and fast rule, but a stationary data should have a small p value. Larger p value could indicate presence of certain trends(varying mean) or seasonality as well.

Finally, Decide your ARIMA Model

Now although I have made several YouTube videos on this topic, if you do not fully understand what an ARIMA model, allow me to present an easy overview:

ARIMA is composed of 3 terms(Auto-Regression + Integrated+Moving-Average)

  1. Auto-Regression:

This basically means that you are using the previous values of the time series in order to predict the future. How many past values you use, determine the order of the AR model. Here’s how an AR(1) model looks like:

Y(t)= Some_Constant*Y(t-1)+ Another_Constant +Error(t)

Simple enough, right?

2. Integrated:

So, remember our talk on stationarity, and how it’s extremely important? Well if you are data set is not stationary, you most often need to perform some sort of difference operation to make it stationary. If you are differencing with previous value, its order 1 and so on. Here’s an example of that:

Forgive my bad drawing. But as you can the series Y(t) was not stationary, because of an increasing trend resulting in a varying mean. We simply subtract it from previous values and voila! It becomes stationary. Depending on your data, you might have to repeat the differencing to get a second order differencing , third order and so on..

3. Moving Average:

This basically means that you are using previous errors to make the future prediction. Also makes sense, right? By seeing how wrong you were in your prediction, you take that into account to make a better prediction. And just like in an AR model, the number of previous errors(also called number of lags) you use, determines the order of the model.

Here’s how MA(1) order equation looks like:
Y(t)= Mean + Some_Constant*Error(t-1) +Error(t)

So our main job is to decide the order of the AR, I, MA parts which are donated by(p,d,q) respectively.

And before you start worrying, let me tell everything is gonna be done automatically. pmdarima library comes to our rescue! It does the job of figuring out the order of the ARIMA all by itself. Here’s how the code snippet looks like:

from pmdarima import auto_arima
stepwise_fit = auto_arima(df['AvgTemp'], trace=True,
suppress_warnings=True)

(Make sure to install the pmdarima library first using pip install pmdarima)

The code is pretty self explanatory. We simple supply our data to the auto_arima function. The function basically uses something called as the AIC score to judge how good a particular order model is. It simply tries to minimize the AIC score, and here’s how the output looks like:

Model performance for different combination of orders

We can see the best ARIMA model seems to be of the order (1,0,5) with the minimum AIC score=8294.785. With this knowledge we can finally proceed to train and fit the model to start making prediction!

Split Your Dataset

Before we actually train the model, we have to split the data set into a training and testing section. We do this because we first train the model on the data and keep the testing section hidden from the model. Once model is ready, we ask it to make predictions on the test data and see how well it performs.

The following code snippet illustrates how to do that:

print(df.shape)
train=df.iloc[:-30]
test=df.iloc[-30:]
print(train.shape,test.shape)

So as you can probably tell, we reserving the last 30 days of the data as the testing section. You can see the shapes of the actual data, and the testing and training sections in the output.

Shape of training and testing section

Finally, We get to the Juicy Stuff!

Surprisingly, creating the ARIMA model is actually one of the easiest steps once you have done all the prerequisite steps. It’s as simple as shown in the code snippet below:

from statsmodels.tsa.arima_model import ARIMA
model=ARIMA(train['AvgTemp'],order=(1,0,5))
model=model.fit()
model.summary()

As you can see we simply call the ARIMA function, supply it our data set and mention the order of the ARIMA model we want. You will be able to see the summary of the model in your output as well.

Model Summary

You can see a whole lot of information about your model over here. Also you will be able to see the coefficients of each AR and MA term. These are nothing but the value of the variables that you saw in the previous AR/MA model equation which were labelled as ‘Some_Constant’. Generally a higher magnitude of this variable means that it has a larger impact on the output.

Check How Good Your Model Is

Here’s where our test data comes in. We first make prediction for temperature on the test data. Then we plot out to see how our predictions compared to the actual data.

start=len(train)
end=len(train)+len(test)-1
pred=model.predict(start=start,end=end,typ='levels').rename('ARIMA Predictions')
pred.plot(legend=True)
test['AvgTemp'].plot(legend=True)

To actually make predictions, we need to use the model.predict function and tell it the starting and ending index in which we want to make the predictions.

Since we want to start making predictions where the training data ends , that is what i have written in the start variable. We want to stop making predictions when the data set ends, which explains the end variable. If you want to make future predictions as well, you can just change that accordingly in the start and end variable to the indexes you want. Your output plot should look like this:

Test values vs Predictions Plot

As you can see the predictions does a pretty good job of matching with the actual trend all though there is a certain acceptable lag.

Check your Accuracy Metric

To actually ascertain how good or bad your model is we find the root mean squared error for it. The following code snippet shows that:

from sklearn.metrics import mean_squared_error
from math import sqrt
test['AvgTemp'].mean()
rmse=sqrt(mean_squared_error(pred,test['AvgTemp']))
print(rmse)

First we check the mean value of the data set which comes out to be 45. And the root mean squared error for this particular model should come to around 2.3. Also you should care about is that your root mean squared should be very smaller than the mean value of test set. In this case we can see the average error is gonna be roughly 2.3/45 *100=5.1% of the actual value.

So with that your ARIMA model is ready to go! In future blogs I am gonna talk about different models and how you can increase the accuracy of the model further.

If you are interested in the video explanation of the same, head over to my YouTube channel for more such content! You can find the GitHub link for the code and data set here: https://github.com/nachi-hebbar/ARIMA-Temperature_Forecasting

Feel free to connect with me on LinkedIn as well!

--

--