Forecasting Hard Coral Coverage with ARIMA
What’s the future of the world’s coral reefs?
In February of 2020, scientists at University of Hawaii Manoa released a study addressing this very question. The models they developed forecasted a 70–90% worldwide loss of coral by 2040. Even more alarming, they projected that “few to zero suitable coral habitats will remain” by the year 2100.
So, the future of coral doesn’t look great.
Interested in seeing these numbers firsthand, today we will develop our own forecasts of hard corals in the Caribbean. After restructuring the data, we’ll fit an extremely famous time series model: ARIMA. ARIMA has been popularized due to its simplicity and specific ability to fit time series data. Once we have a working model, we’ll develop a forecast.
Let’s jump right in.
Why do you need to aggregate?
We’ll start by taking a look at our raw data. In this case, we are going to be building a univariate model, so we’re only concerned with hard coral percent cover (i.e. the estimated percentage of hard coral on the sea floor).
In the figure to the left, we have plotted the daily average of hard coral over time (blue). The y-axis shows the percent cover and the x-axis shows the date, ranging from 1997–2019. We also plotted a weighted linear regression line (red) to depict the overall trend.
Ok, seems straight-forward.
But if we try to interpret these data, we see a blue mess with a negative trend line; there appears to be little systematic movement. Moreover, according to the regression line, hard coral only decreased by around 4% over the past 22 years. That’s pretty hard to believe. So, as creative and skilled data scientists, let’s try some manipulations and see if we can develop a clearer picture.
First, it’s important to know how the data are organized. Unlike most time series datasets, these data were sampled at different locations around the Caribbean with few subsequent draws at the same site. Moreover, they are not equally sampled over time. So, to account for the above points, let’s aggregate the data by time.
In the above figures, you can see the effect of averaging the data by monthly, quarterly, and yearly timeframes, respectively from left to right. As you “zoom out,” you reduce the variability in the data. Encompassed in this variability is both signal (good) and noise (bad). So, while the figure on the right shows the clearest trend, we’ve probably thrown out a lot of useful data.
To encompass the maximum and minimum amount of information, we will fit our models to both the monthly and annually aggregated data.
Great, aggregation is done. On to our next manipulation: differencing.
Why do you need to difference?
As a practical matter, most time series models assume something called stationary. In short, strong stationary means that each data point is pulled from the same theoretical probability distribution. But, because we cannot know what this population distribution looks like, we assume weak stationarity and develop proxies for consistency in the data, namely a constant mean, variance, and covariance over time.
If you recall in the aggregation plots above, we saw a trend, indicating the mean is not constant over time. Furthermore, while the variance is harder to eyeball, there appears to be less spread in the data from 2006 to 2012. After performing a Dickey-Fuller unit root test, our observations proved correct; the monthly and annually aggregated datasets are not stationary.
So, to make the data useable for the ARIMA model, we will perform differencing, which simply involves subtracting each value from its prior value.
As you can see in the plots above, the y-axis values completely change from plot to plot. The leftmost plot, our un-differenced data, shows percent cover ranging from 12% to 28%. However, in the middle plot we are now working with the first difference, which shows the year over year change ranging from -6%-(+6%). The rightmost plot shows the second difference, or biannual change, with the y-axis range doubled as compared to our first-differenced plot.
Not only does the y-axis change but the red trend line appears to flatten and the variance becomes more consistent over time. Lovely. This is what we wanted.
To double check, we again use the unit root test and find that annual data with a difference of 2 passes the stationarity test.
So, after performing similar steps for the monthly data, our datasets are now good to go. Ready to model?
Tuning the ARIMA Model
ARIMA is three-part model that combines autoregressive (AR), integration (I), and moving average (MA) components. First, the AR component looks back at prior values in our data and uses them to fit the current value. Second, the I component simply means the data are differenced, the same concept we discussed above. And third, the MA component looks back a prior errors in our fit and uses them to predict the current value.
It’s not necessary to understand exactly how this works, but if you’re a knowledge loving person, this YouTube playlist does a great job of explaining ARIMA models (it’s also probably the best youtube tutorial I’ve ever seen).
In its most basic form, ARIMA has three tuning parameters:
- p: how many prior values we use to fit the current value (i.e. for annual data, how many prior years of data should have an impact on the current year’s value).
- d: how many times we difference.
- q: how many prior errors we use to fit the current value.
Note that we’ve already found d, so we just need to find p and q. To do this, we will use autocorrelation plots, which are shown below in Figures 8–9.
But why are there two plots? I thought there was only one variable: hard coral. That’s an outstanding point. If you’re a visual learner, check this out.
Either way, here’s a short explanation. The AutoCorrelation Function (ACF) plot on the left shows the correlation between hard coral now and hard coral at prior time periods, in this case years. The x-axis shows the number of years back we’re looking, and the y-axis shows the correlation between current hard coral and hard coral at the lagged time. The PartialAutoCorrelation Function (PACF) plot is very similar, however it also adjusts for the correlations of the values between the current time period and our lag by holding them constant. That’s why PACF is Partial; it doesn’t show all the correlations, whereas ACF does.
In practice, we use the ACF plot to determine how many prior years are important for the AutoRegressive (AR) component. We then use the PACF to determine the Moving Average (MA) portion of the model. These values are called orders.
As you can see, the ACF plot has significant lags at 0 and 2, as indicated by a correlation greater than our significance threshold (blue dotted line). Note that a lag of 0 is simply the same time period, so a correlation of 1.0 makes sense. Applying the same process to the PACF plot, we can see a significant lag at 2. This leaves us with a p=2 and q=2. So, our annual ARIMA(p, d, q) would take the form ARIMA(2,2,2).
Wow, that was a lot of setup, but now we’re well-equipped for fitting the data.
Training ARIMA
Here we will be looking at 3 different models, namely:
- ARIMA(2,2,2) with annual aggregation (developed above).
- ARMIA(3,1,1) with monthly aggregation.
- ARIMA(6,1,1) with monthly aggregation.
To evaluate each models’ performance, we will use the squared correlation between the model’s fitted values and the true values; the r-squared closest to 1.0 is the winner.
Monthly Aggregation
Our ARIMA(3,1,1) model doesn’t perform great, exhibiting an r-squared valued of 0.156. We then run our ARIMA(6,1,1) model and get a very similar r-squared of 0.170. Unfortunately, it doesn’t look like we’re doing a great job of fitting.
As show above in Figures 10–11, our poor fit is reflected by the lack of systematic trend between our fitted values (x-axis) and true values (y-axis). It also seems like our range of fitted values is far smaller than our true range; we’re off by about 15 percentage points in the high and low ends. So, not only is the fit bad, the scaling of the fit is bad as well.
Annual Aggregation
Luckily, annual aggregation comes in to save the day with an r-squared of 0.689, indicating a very good fit (given the data).
As shown in the plot to the left, we have the true values in blue as compared to the fitted values in yellow. The fit looks pretty good, although it seems like the model is capturing the trend a little too late. Moreover, we don’t see the extreme scaling difference observed in the monthly models; instead, the fitted values are more extreme than the true data, as shown in 2003 and 2014. That being said, this appears to be a reasonable fit.
So, why did annual aggregation perform so much better? Well, it appears that our annual aggregation was able to smooth out the noise, allowing the model to accurately decipher trends. That being said, you can’t help but wonder what information was also thrown out by averaging.
In hopes of improving fits for both the monthly and annual models, the following variations were tested:
- Adding a seasonal component with orders (3,0,1) and (6,0,1). Note this was only done for monthly data, but worsened the fit in both cases.
- Fitting with a predictor dummy variable: [1, 2, …, n-1, n]. This helped the fit significantly.
- Testing different p, d, q values (+/- 1), which worsened the fit.
Ok, fairly confident we have maximized our training fit, let’s move on to forecasting.
Forecasting With ARIMA
Here we are going to predict the next 5 years of hard coral coverage using our annual ARIMA(2,2,2) model with the above dummy predictor.
As indicated by the light blue trend line (in the dark shaded region), the forecasted percent cover is negative but not precipitous; the predicted value for 2025 is 13.46% cover. Moreover, the error bands are so large it’s impossible to make precise conclusions.
That being said, the error bands do provide interpretive power. The dark blue, dark grey, and light grey bands correspond to a 67%, 90%, and 95% confidence interval respectively. This means, for instance, that we are 67% confident that 5 years from now our percent cover will be between 5.99 and 20.95.
To further interpret these confidence intervals, according to our model’s 95% confidence interval, we are 97.5% certain that we will not lose all hard coral in the Caribbean over the next 4 years. However, note that on the 5th year, this statement does not hold. Now looking at the upper side of our 95% confidence interval, we are 97.5% certain that in the next 5 years we will not surpass 29% cover, roughly corresponding to our our 22-year high observed in 2003.
Well that was interesting, but a bit underwhelming. To produce a more precise forecast, we’ll have to try other techniques (hint: there will be a part 2).
Conclusion
Three major takeaways from today’s analysis:
- The Reef Check measures of hard coral do not provide much autoregressive signal until they are aggregated annually. That being said, we are exploring other aggregation methods, for instance our prior post.
- Using univariate ARIMA, we can get a training r-squared of around 0.7 which is pretty good for such noisy data — just think about all the factors that can impact hard coral.
- However, our forecasts have extremely wide prediction intervals, meaning there’s a lot of uncertainty.
In following posts, we will add predictor variables to our time series model. We will also try other models to see if they produce better results. If you have ideas on where to go, leave a comment or reach out here.
Sources
- Cryer, J. D., & Chan, K. (2011). Time series analysis with applications in R. New York: Springer.
- Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice. Melbourne: OTexts.
- Warming, acidic oceans may nearly eliminate coral reef habitats by 2100. (n.d.). Retrieved September 02, 2020, from https://news.agu.org/press-release/warming-acidic-oceans-may-nearly-eliminate-coral-reef-habitats-by-2100/
The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. With 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24, there were plenty of data points to conduct a TS analysis. However, the sampling variation differs greatly across location and time period. To combat this, we performed aggregation over time, however the difference in location still posed analysis problems. We largely ignored these, but analysis determining whether sampling location has a significant impact is required to derive conclusions.
Here is the code.
Note: these are my findings. If you would like to contact me, leave a message here. All criticisms are welcome.