# Time Series Analysis with Python

A time series is a series of data points indexed in time order, normally with equally spaced points in time. Examples of time series are stocks’ prices, monthly returns, company’s sales and so forth. Time series can be seen as data with a target variable (price, returns, amount of sales…) and one feature only: time. Indeed, the idea of Time Series is that we can extrapolate interesting information by analyzing the behavior of a given variable throughout time. Then, if relevant findings emerge, it will be ideally possible to predict the trend of our value in the future.

Before starting the analysis, it is worth spending some lines about the models we are going to employ. They belong to the class of Autoregressive Moving Average, or ARMA models. The idea of ARMA models is that, given a time series *Y*, the value of *Y *at time *t *might depend on its past values (this is the autoregressive component) and on its past error terms, which are supposed to be white noise (this is the moving average component).

In general, every statistical model is estimated following a defined method (namely, Regression is estimated with the Ordinary Least Square method). In our case, ARMA models are estimated according to the Maximum Likelihood Estimation (MLE). The ground idea is very intuitive: the MLE of a vector of parameters of a given sample is the value for which this sample is more likely to be observed.

The fitting process of the model will return the estimated coefficients. However, we need to know the order (*p,q*) of that model. To do that, the first step of time series analysis (and I’d add, of any data analysis task) is studying and getting familiar with our data, inspecting them and formulating some first assumptions.

For this purpose, I’m going to use a free dataset available on Kaggle, which contains the monthly production of beer in Austria:

`df=pd.read_csv('monthly-beer-production-in-austr.csv')`

df.head()

Let’s have a look at the behavior of our data:

df['datetime'] = pd.to_datetime(df['Month'])

df = df.set_index('datetime')

df.drop(['Month'], axis=1, inplace=True)

df.plot()

As you can see, there was a clear upward trend between 1959 and 1979, then the production seemed to plateau and slowly decrease up to 1985: from here, it jumped upward again with some falls too, hence it is more difficult to identify a clear pattern.

Now, every time series can be decomposed into four main elements or components:

**Level**: it is the average value of the series**Trend**: it is the ‘direction’ of that value, which can be increasing or decreasing**Seasonality**: it is a repeating short-term cycle in the series**Noise**: it is a random variation in the series which cannot be explained by the available information

Note that all series have level and noise, while trend and seasonality are optional. To verify the presence of these optional elements, it is useful to split the series into these components, and we can do so by using some Python tools:

`from statsmodels.tsa.seasonal import seasonal_decompose`

result = seasonal_decompose(df, model='additive')

result.plot()

Here, I assumed that the model is additive. It means that the value of our variable is given by the summation of the components above:

`Y(t)=level + trend + seasonality + noise`

On the contrary, there are models which assume that the value is given by the multiplication of those elements, rather than the summation.

Now a further concept needs to be introduced: stationarity.

Ideally, we want to manage series which are stationary (that means, having constant mean and variance over time) in order to bypass the main concern of time series: the impossibility of observing, at a fixed time *t*, all the possible realizations of the target variable. Indeed, we only observe one path of realizations over time of any given variable.

There are different ways to make a time process stationary. One of these is simply taking its first difference, once subtracted the seasonal component:

`import numpy as np`

x=df-result.seasonal

diff=np.diff(x)

plt.plot(diff)

It now looks more stationary. Well, these interventions we made — getting rid of seasonality and taking the first difference — can be implemented directly into the model, building a so-called SARIMA (Seasonal Autoregressive Integrated Moving Average). Indeed, the “integrated” component considers the *dth* difference of the series, while the “seasonal” component takes into account seasonality.

So, as I said before, the last step before training the model is deciding the number of parameters *p*,*q* and, in this specific case, *d *(order of difference we have to take to make our series stationary). A good way to proceed is inspecting the Autocorrelation Function and Partial Autocorrelation Function of our series:

`from statsmodels.graphics.tsaplots import plot_acf, plot_pacf`

plot_acf(diff, lags=10)

plot_pacf(diff, lags=10)

These functions tell us whether time series values exhibit some kind of correlations with their past values and, if yes, up to which time lag. These serial correlations can be computed taking into account all the correlations among values between *t* and *t-n* (ACF) or considering only the correlation between values at *t* and *t-n* (PACF).

How can one interpret these two plots? The basic recipe is the following:

Generally speaking, it is always a good practice to train more than one model and compare them according to given criteria (I will dwell on this concept later on). Luckily, there is a function in Python which will make this comparison in a few seconds, returning the best combination of parameters (again, we still have to define what “best combination” means).

Before employing this function, let’s train a very simple ARIMA model by choosing low values of *p*, *q* and *d*.

`from statsmodels.tsa.arima_model import ARIMA`

model = ARIMA(diff, order=(1,2,0))

model_fit = model.fit()

print(model_fit.summary())

The very first operation to be done once trained the model is inspecting its residuals. Indeed, it is fundamental, for our model to be accurate enough, that residuals are stationary, without a specific trend. Indeed, if there was a trend, it would mean there is still a pattern the model didn’t capture, there is some available information that has not been used. If residuals are not stationary, hence our model needs more (or a different combination of) parameters.

Let’s examine our residuals:

`residuals = pd.DataFrame(model_fit.resid)`

residuals.plot()

They might look stationary, but we need to test it. To do that, I will employ the Ljung-Box test, whose hypotheses are:

- H0: The data are independently distributed
- H1: The data are not independently distributed (hence, they exhibit serial correlation)

Let’s examine the output of this test run up to lag 10:

`from statsmodels.stats.diagnostic import acorr_ljungbox`

acorr_ljungbox(residuals, lags=10)

The second list in the returned array contains the *p-values* from the Ljung box test: since all *p-values* are below 0.05, you can reject the null of no auto-correlation between the series and each of its first 10 lags with > 95% confidence level. As you can see, the test rejected our first assumption of the absence of correlation. It means our model was not able to capture all the available information, which is comprehensible, considering we randomly set its parameters.

Now, as anticipated, let’s examine the best model using the *auto_arima() *function. It returns the best ARIMA model according to either AIC, HQIC or BIC value. These latter (in the red box in model’s summary above) are three different information criteria: they bypass the risk of excessive complexity that would emerge if we relied only on the MLE (which tends to increase the number of parameters to maximize the joint probability of occurrence). Indeed, these criteria penalize the MLE function when the number of parameters increases, following the “law of parsimony” that became a fundamental principle of statistical theory (to quote Williams of Ockham (1287–1347), ‘best models are those which are simple and fit the data well’).

So let’s first divide our time series into train and test set. Here, I’m going to use again the original, no-stationary dataset, so that you can see how the *auto_arima()* will try to fix it with a specific model.

train = df[:int(0.7*(len(df)))]

test= df[int(0.7*(len(df))):]train['Monthly beer production'].plot()

test['Monthly beer production'].plot()

Now let’s run our *auto_arima()*. To do so, you need to install the *pyramid *package via pip install *pip install pmdarima *on your console.

`from pmdarima.arima import auto_arima`

model = auto_arima(train, trace=True, error_action='ignore', suppress_warnings=True)

model.fit(train)

As you can see, the best ARIMA model is that with 1,1,1 parameters. Now it’s time to make some forecasts:

import matplotlib.pyplot as plt

forecast = model.predict(n_periods=len(test))

forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])#plot the predictions for validation set

plt.plot(train, label='Train')

plt.plot(test, label='Test')

plt.plot(forecast, label='Prediction')

plt.show()

As you can see, the model was able to capture (and exaggerate a bit) the slightly upward trend starting again after 1985. It is evident that our model interpreted the sudden decreasing of our beer production as temporary, forecasting an upward trend for the long run.

Finally, we can evaluate our results using the Root Mean Squared evaluation metric:

from math import sqrt

from sklearn.metrics import mean_squared_errorrms = sqrt(mean_squared_error(test,forecast))

print('Root Mean Squared: {}'.format(rms))

To sum up, Python offers several tools for Time Series Analysis, and you can decide whether relying on *auto_arima()* or finding your best number of parameters with the aid of information criteria.