Interpreting ACF or Auto-correlation plot
Time series is linearly related to a lagged version of itself.
What is ACF plot ?
A time series is a sequence of measurements of the same variable(s) made over time. Usually, the measurements are made at evenly spaced times — for example, monthly or yearly. The coefficient of correlation between two values in a time series is called the autocorrelation function (ACF). In other words,
>Autocorrelation represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals.
>Autocorrelation measures the relationship between a variable’s current value and its past values.
>An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation of negative 1 represents a perfect negative correlation.
Why useful ?
- Help us uncover hidden patterns in our data and help us select the correct forecasting methods.
- Help identify seasonality in our time series data.
- Analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) in conjunction is necessary for selecting the appropriate ARIMA model for any time series prediction.
Any assumption made by ACF ?
Weak stationary — meaning no systematic change in the mean, variance, and no systematic fluctuation.
So when performing ACF it is advisable to remove any trend present in the data and to make sure the data is stationary.
Try out with a real dataset?
data = pd.read_csv('data.csv',
engine='python',parse_dates=[0],
index_col = 'Time',
date_parser = parser)
st_date = pd.to_datetime("2008-01-01")
data = data[st_date:]
The plot of the data looks like below:
Now, before performing the ACF let’s remove the trend and see how it looks like:
#acf -> remove trend
data["diff"] = data.diff()
ax = data.plot()
ax.legend(ncol=5,
loc='upper center',
bbox_to_anchor=(0.5, 1.0),
bbox_transform=plt.gcf().transFigure)
for yr in range(2008, 2018):
ax.axvline(pd.to_datetime(str(yr)+"-01-01"), color ="red", linestyle = "--", alpha = 0.2)
Now let’s apply the ACF:
from statsmodels.graphics.tsaplots import plot_acf
data["diff"].iloc[0] = 0
plot_acf(data["diff"])
plt.show()
Can you see the seasonality present?
Notice how the coefficient is high at lag 3, 6,9,12. In terms of the month if I have to say then, high positive correlations for March, June, September, December, whereas Jan, Feb and April have negative correlations but that too vanishes with lag. We will focus on the points that lie beyond the blue region as they signify strong statistical significance.
Important note: make sure your data doesn’t have NA values, otherwise the ACF will fail.
Can we look at the trend and seasonality separately to dive deep into the data?
Yes, let's decompose the data. I am going to use a stats model API for this purpose but one can use NumPy and Pandas as well to decompose the three parts of a time series -trend, seasonality, residual.
from statsmodels.tsa.seasonal import seasonal_decompose
res = seasonal_decompose(data, model = "additive",period = 30)
fig, (ax1,ax2,ax3) = plt.subplots(3,1, figsize=(15,8))
res.trend.plot(ax=ax1,ylabel = "trend")
res.resid.plot(ax=ax2,ylabel = "seasoanlity")
res.seasonal.plot(ax=ax3,ylabel = "residual")
plt.show()
Notice how I chose additive instead of multiplicative since there is no exponential increase in the amplitudes over time.
Now if I run the same ACF plot on the res.seasonal component generated by the API we will get the same coefficients as before.
I hope this is helpful. Time series analysis can be confusing and time taking. So, it’s imperative to have fundamental concepts clear. I myself am in the process of learning. So, before you go do leave a comment or your valuable feedback. :)
References:
https://github.com/ritvikmath/Time-Series-Analysis/blob/master/Time%20Series%20Data.ipynb