Time Series in Nature: Predicting the Water Level of a Wetland

Jade Adams
5 min readDec 8, 2021

--

The UNESCO Heritage Upo Wetlands, the subject of the data science paper.

Predicting water levels has become a key focus of policymakers in the 21st century as climates become more unstable and demand rises. Governments face difficult tradeoffs balancing the agricultural, environmental, industrial and municipal use of water. Data science has played an increasing role in this sector. I review a paper that overviews some of the latest techniques and finds an ingenuitive way to predict water levels in the Upo wetland, one of the largest wetlands in South Korea and a UNESCO-heritage protected site. 4 researchers wrote a very user-friendly research paper on advanced machine learning methods on modelling the short-term water levels of a wetland using exogenous variables. The link can be found here. In this article, I summarize the ingenuitive approaches to time series modelling they used.

The Data

The study’s data comes from a sensor installed in 2009 which recorded water levels every 10–15 minutes from 2009 to 2015. The data was grouped up to a daily average to be the endogenous, dependent variable.

The 8 exogenous independent variables used consisted of the daily average, maximum, and minimum temperatures, daily precipitation, daily average and maximum wind speed, and the water levels of the two connected bodies of water: the Shindang drainage pump and the Mokpo embankment, depicted in the figure below.

To predict 1 day in advance, the data was rolled back respectively for predictions, meaning there are 3 different sets of data for each variable, meaning that in total there are 24 variables of prediction.

Evaluation Metrics

They used 4 different evaluation statistics to compare machine learning methods. They used the Correlation Coefficient (CC), the Nash-Sutcliffe efficiency (NSE), the root mean square error (RMSE), and the Persistence Index. Each provides different strengths. The formulas are pictured below.

The evaluation metrics applied to all tested models.

The Correlation Coefficient measures the similarity of relative movement. It shows the strength of linear correlation between actual and predicted values on a scale of -1 to +1. Meanwhile, the NSE determines fit of the values by plotting the predicted and actual values on a 1 to 1 line, and calculating the variance of the line. NSE values of 0 to 1 indicate an increasingly decent fit, while negative values suggest problems. RMSE, exactly as the name suggests, calculates the average of the error times. These measures give a broad sense of model accuracy.The Persistence Index is a unique element of analysis I hadn’t heard of before. The Persistence Index tests the phase and amplitude error of a model.

Of most importance to policymakers is the prediction of water levels in extremes, so they chose to compare these models along 4 instances of peak water levels between 2013 and 2014, imaged below, in addition to the rest of the dataset. Isolating these time instances for testing was incredibly useful for comparing model performance in less common circumstances.

Modelling

As this is a short-term prediction problem, just predicting water levels 1–3 days ahead, with no seasonality, they had a broad availability of potential Machine Learning Models. With short-term prediction, you don’t necessarily need traditional time series models like ARIMA, SARIMAX, or Exponential Smoothing. They trained and tested an Artificial Neural Network, Decision Trees, Random Forests, and Support Vector Regression. By comparing model scores using the 5 different evaluation statistics across the normal time periods and the 4 peak periods, they were able to holistically choose a model that performed the best.

They took the time to optimize parameters for each type of model. For ANN, they scaled values from 0 to 1 for best performance and found the best number of nodes. Meanwhile for the Decision Trees, they did their own ‘pruning’ (lowering the number of branches in order to reduce overfit) by running a cross validation with a complexity cost function, and compared the cross-validated scores on complexity cost and RMSE. For their Random Forest, they chose the top 8 predictive variables and created 500 different trees to create their function.

Lastly, for Support Vector Regression they used a radial-basis function to find the appropriate parameters. In simple terms, support Vector Regression is an advanced machine learning method which involves finding hyperparameters in multidimensional space that allows separating classes of the input variable to create a function that predicts the target variable. Sklearn has an SVR class for those interested in playing around with it in regression. I also found this article with a valuable tutorial on those who want to look into it more.

Concluding their modelling, the random forest model was the best. It was actually the only model with a positive Persistence Index; however, all the models performed acceptably on the rest of the evaluation statistics. They found the most important variables for prediction were yesterday’s precipitation, yesterday’s water level at Mokpo (depicted in the map at the top in case you forgot), and three day ago’s water level at Mokpo.

Conclusion

This was an interesting study that had lots of personal impact in my own work on time series. Specifically the use of multiple evaluation statistics that looked beyond the error terms. Persistence Index, as it tracks phase and amplitude, in particular is strongly suited for testing time series with seasonal data.

For some of my own work on time series, please go to my GitHub project (link here) and see my latest work on finding the benefits of time series machine learning for California reservoirs.

Research Link

https://www.mdpi.com/2073-4441/12/1/93/pdf

--

--

Jade Adams

She/Her. UCLA graduate and data scientist based in NYC. Passionate about social science research and all things trans and queer.