Predicting Annual Water Usage in Baltimore using ARIMA
Time Series Forecasting Walkthrough
This is a project walkthrough from the book Introduction to Time Series Forecasting with Python by Jason Brownlee. He does a great job introducing the basic topics in univariate time series and does a walkthrough at the end summarizing everything we learned.
The Annual Water Usage in Baltimore dataset has 79 data points each representing yearly water usage in Baltimore from 1885–1963. The first thing we should do is put some data points aside for testing our model. Ideally we want to split the data into three sets: training, development (validation), and test. This is the methodology that Andrew Ng uses and we will follow.
We will put 35 data points for train set, 34 data points for development set and 10 for test set. This is important because we want to simulate a real production scenario where we do not know what the new data (test set) will look like.
This line graph shows that there is an upward trend. Seasonality is a bit hard to tell from this graph. There does not seem to be a pattern that occurs frequently. There are a few anomalies such as a huge spike around 1890, 1912, and 1941. There is also a dip around 1927 and a possible one after 1950.
The distribution of the water usage looks almost like a bell curve (normally distributed). There is an extra lump around 575–650. This might indicate that a transformation is needed to make it more normally distributed.
The box plot shows the distribution of water usage grouped together by decade (the first box plot shows data from 1885–1895, the second one shows data from 1895–1905, etc.). Based on this graph, we can see that the median water usage (the green line in the box) shows an upward trend but it is not a linear one. Also the variability (height of the boxes) of the usage is high during the 1915 and 1945 decade. There are some outliers (represented by the empty circles) in the first two decades. Last but not least, the 1935 decade seems to have a lower average consumption than expected. It didn’t follow the trend of going up from last decade.
Augmented Dickey-Fuller Statistic
Before modeling we need to make sure the time series is stationary (check out why). One way to check for stationarity is to apply an Augmented Dickey-Fuller (ADF) Test. This is a hypothesis test where the null hypothesis states that the time series is not stationary and the alternative hypothesis states that it is stationary.
Running the ADF test on the time series gives us the result below:
The ADF statistic is ~ -2.267 which is bigger than the 5% critical value of -2.906. This means that we failed to reject the null hypothesis. Therefore the time series is NOT stationary and that we need to apply differencing.
Differencing is when we subtract a value from the previous value. This is a useful technique because it takes away some of the trend and seasonality of the time series.
Just for fun, we can run another ADF test after differencing. Our new ADF statistic (after differencing) is -6.127. This is a lot smaller than the 5% critical value. It is even smaller than the 1% critical value of -3.534. We can now reject the null hypothesis which means that the time series after differencing is stationary.
The next analysis to do is to run an Autocorrelation Function (ACF) and a Partial Autocorrelation Function (PACF) on the time series. The ACF is used to check how strong the relationship is between a value and its values in previous time steps. The ACF plot shows the relationship between the a value and its values from the previous 20 lags or time-steps.The blue highlight shows the 95% confidence interval. If the values are outside of the blue, then those lagged values are more significant. In this case there are 4 lags that are significant excluding the first one. The first one is itself so of course the relationship will be very significant. The next plot is the PACF. This plot summarizes the relationship between a value and its previous value with any correlations removed.
Now that we did our data analysis, we can start modeling. The most popular model used for time series is ARIMA. It is an acronym for AutoRegressive Integrated Moving Average. To run this model is very simple. The main parameters in ARIMA are p, d, q. The p is the number of lag observations. The d is the number of times that the data is differenced. The q is the size of the moving average window.
How do we know what those values are? Well we already did that. The analysis we did above determines the values of p, d, q. The p is determined by the ACF plot, the d is determined using the ADF statistic, and the q can be determined by the PACF plot. Based on our analysis above, p=4, d=1, q=1. These are not the final values of p,d,q, but it gives us a good place to start. We plug in those values in the ARIMA model and apply a walk-forward method to forecast the next value.
And that is just the baseline for an ARIMA. There are few more things we can do to improve the model such as power transforming the time series, applying grid search to find the optimal values for the ARIMA model (p,d,q) and applying a bias correction to our forecast. The code in GitHub takes us through these steps.