Can Machine Learning Be Used in Long-Term Reservoir Level Prediction?

Jade Adams
5 min readFeb 13, 2022

My capstone project on time series prediction at major California reservoirs demonstrates potential opportunities for water policy-makers.

The reservoir I grew up by, Folsom Lake, has faced major instability in our era of Climate Crisis.

It’s been over a month since I presented my capstone and graduated from the Flatiron School’s full-time Data Science Immersive Program. I’m proud of my capstone and I learned a lot on the way. To see the notebook in full, I’ve linked it here in my GitHub.

I grew up in California by a key reservoir, Folsom Lake, understanding the importance of protecting our water resources. As our climate has destabilized, California faces increasingly intense droughts and inconsistent rainfall. Managing our water is key to the survival of California’s economy and ecosystems. (Although it’s important to highlight how reservoirs ironically harm long-term water availability. See the latest research and the current efforts by the indigenous Klamath nation in Northern California to dismantle a major dam to restore their river and land)

My Capstone Idea

I did some research on past machine learning implementations on water levels, and found machine learning is common for predicting and preventing dam failures from flash floods (such as the 2017 Oroville Dam failure), but found a shortage of long-term machine learning research. For my capstone project, I sought to model reservoir levels across California a year in advance. My goal for this project was to see if machine learning could pick up on the seasonality and ebb and flow of reservoirs. I also wanted to use exogenous variables like precipitation and temperature to see their impact.

Using time series and a Seasonal Auto-Regressive Moving-Average model (SARIMA), I was able to fit a model to Folsom Lake that was over 90% accurate. I expanded the model out to other reservoirs in California and had mixed results from reservoir to reservoir. From this, I concluded that human operation of the reservoir outflow can have a big factor in reservoir storage. In the future, I want to expand to modeling the inflow as it seems there is much more clear seasonality.

The Data

By capturing snowpack and rainfall data for each watershed, I was able to group the reservoirs by watershed in my modeling.

I used reservoir level data from the US Bureau of Reclamation’s Reclamation Information Sharing Environment (RISE) website. I also attempted to use evaporation and precipitation data at each lake, also pull-able from the website. I also pulled precipitation data from the Department of Water Resources North and South Sierra, located on the California Data Exchange Center website. With all this data, I built a time series, with exogenous variables and one endogenous variable, reservoir level, from 1990 through 2020. One important note with water data is that Water Years are from October to September. Please see the end for links to the data sources.

Modeling

Using the pmdarima, statsmodels, and sk-learn libraries, I tested a variety of time-series based machine-learning models on the Folsom Lake data. I first ran a Seasonal Auto-Regressive model, without using any of the exogenous variables. After grid searching for the best parameters, the fit model got 91% accuracy to the mean water of the test 2021 water year and matched 92% of the ebb and flow. The SARIMA model ended up as my final model for Folsom Lake.

I tried implementing exogenous variables into the SARIMA model, known as a SARIMAX model, (X for exogenous) but grid-searching proved to be too computationally intensive. There is a lack of public libraries that provide such capability- I would’ve had to create my own classes and methods. (a side note- if anyone knows of libraries for time series with exogenous variable optimization please let me know!)

I expanded my model out to the other major reservoirs of the Sacramento and San Joaquin watersheds, and showed how the mean level and ebb and flow compared to actual, seen below. For each of them, I used individually grid-searched SARIMA parameters.

Conclusion & Lessons Learned

For most of the other reservoirs, the scores were lower than at Folsom Lake. This probably has to do with the variance in outflow operation, decided daily by water policymakers, and in annual rainfall. In conclusion, I assess that time series models can be very useful; however, modeling inflows would probably be more accurate than the water levels given the change in outflow policies over time.

This was also my first major solo project where, instead of having a dataset first then asking questions, I asked questions then found the necessary data to answer them. I learned the challenges of needing the right data and framing the questions correctly. I had to go back and re-query API data multiple times because my questions changed, so in future I learned it’s important to pull a broad dataset, keeping in mind that your goals may change, and parse it in your notebook.

In particular, as the project drew along, I also realized how dependent reservoir levels can be on man-made policy decisions. In the future, I seek to model inflow. I also seek to incorporate long-term water data such as whether it’s an El Nino year or not, a key predictor for rainfall in the state. I would also be curious to see the accuracy of modeling short-term water level changes in response to storms, like a paper I recently reviewed. Year-out modeling, while good at predicting the seasonal ups and downs of the reservoirs, can struggle at capturing the actual water level.

All in all, this project opened a lot more questions to how machine learning can be used by water policymakers.

Data Sources

California Data Exchange Center — http://cdec4gov.water.ca.gov/dynamicapp/QueryWY

US Bureau of Reclamation — https://data.usbr.gov/catalog/2304

Become A Writer

--

--

Jade Adams

She/Her. UCLA graduate and data scientist based in NYC. Passionate about social science research and all things trans and queer.