An Example of Forecasting Water Levels Using an LSTM (Case study in the Everglades)

Gerald A. Corzo
Hydroinformatics
Published in
14 min readMar 19, 2024

--

An important practical example is that it is very useful to start with forecasting using Long Short-Term Memory (LSTM) networks. A well-known problem in hydrology is the forecasting of water levels. In the Everglades, there are interesting uses for this information, and there is public data from the Everglades Depth Estimation Network (EDEN). The objective of this exercise is to develop a model to predict future water levels at a monitoring station within a wetland ecosystem.

The beauty of deep learning, particularly in the context of LSTM networks, lies in its capability to learn and discern patterns within time series data autonomously, bypassing the need for extensive feature engineering. A practical approach is the following:

  1. Downloading the data: To initiate your project, access the EDEN network’s repository (here). For this example, we focus on the “3A11” station from the Everglades Depth Estimation Network (EDEN). Place these two files in the same folder as your code.
  1. Preparing Your Dataset: The journey from raw data to actionable insights involves several key steps:
  • Dividing the Data: Categorise your dataset into training, validation, and testing segments. This crucial step ensures your model learns effectively, validates its learning, and is tested on unseen data.
  • Scaling Your Data: Implement data normalisation to scale the features, a preparatory step that significantly impacts the model’s performance by ensuring all input features contribute equally to the learning process.
  • Preparing inputs for TensorFlow: Format your data to be digestible by TensorFlow, adhering to its requirements for model training.

3. Setting Up Your Model: Architect your LSTM model and select optimization parameters. This stage is where you define the structure of your neural network, tailor it to understand the nuances of water level forecasting, and choose the parameters that guide its learning process.

4. Testing and Visualizing Results: Once trained, evaluate your model’s performance on the test set. Visualization plays a pivotal role here, offering insights into the model’s predictions versus actual outcomes and shedding light on areas for improvement.

By the end of this post, you’ll have a clear blueprint for employing LSTM networks in forecasting water levels, empowering you with the knowledge to tackle similar challenges in environmental data analysis.

Step 1: Gathering Data for Water Level Forecasting

To kickstart our journey in forecasting water levels using LSTM networks, the first step involves procuring the necessary data. For this, we turn to the Everglades Depth Estimation Network (EDEN), a treasure trove of environmental data from the Everglades National Park. EDEN provides accessible, detailed time series data from various monitoring stations across the wetland, offering invaluable insights into its aquatic dynamics.

For our analysis, we focus on the “3A11” monitoring station. This particular station stands out due to its comprehensive dataset, free of missing values, making it an ideal candidate for our forecasting model. You can download the data directly from the EDEN website.

Once you’ve downloaded the rainfall and water level datasets, the next step involves loading and preparing this data for analysis. Here’s how you can do it using Python:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Loading the data
rdf = pd.read_csv('eden_rainfall.csv', parse_dates=["date"]) # Rainfall data
wdf = pd.read_csv('eden_waterlevel_feet_NAVD88.csv', parse_dates=["date"]) # Water level data

# Transforming the data for ease of use
RfT = rdf.pivot(index='date', columns='gage', values='rainfall_inches')
WlT = wdf.pivot(index='date', columns='gage', values='water_level_feet_NAVD88')

''' Format of the DataFrame fore rainfall and water levels
0 date object
1 gage object
2 rainfall_inches float64
------------------------------------------------
0 date object
1 gage object
2 water_level_feet_NAVD88 float64
'''

In this code snippet, we load the rainfall (rdf) and water level (wdf) data, ensuring that dates are correctly parsed. The data is then transformed using the pivot method to restructure it into a more analysis-friendly format, where each row corresponds to a date and each column to a different monitoring station. This transformation facilitates direct access to the time series data of our station of interest, "3A11".

# Extracting data for the "3A11" monitoring station
rainfall_3A11 = RfT['3A11'] # Rainfall data at "3A11"
water_level_3A11 = WlT['3A11'] # Water level data at "3A11"

By focusing on the “3A11” station, we can solve the problem individually. This will help us with our more in-depth analysis and predictions.

Visualizing the Challenge: Rainfall vs. Water Levels

Before we start making predictions, it is important to see how our data looks. This step not only checks that our data was loaded and preprocessed properly, but it also gives us a look at the problem we need to solve. In order to make accurate predictions, we need to know how weather affects water levels.

Here is a way to plot the rainfall and water levels next to each other, but the rainfall will be shown backwards. In hydrology, this way of showing things visually is often used to better show how rain affects bodies of water. It will look like the rain is falling from above, which is how it really falls. This makes it easy to connect rain events with changes in water levels.

# Plotting
fig, ax1 = plt.subplots(figsize=(12, 6))

color = 'tab:blue'
ax1.set_xlabel('Date')
ax1.set_ylabel('Rainfall (inches)', color=color)
ax1.plot(rainfall_3A11.index, -rainfall_3A11, color=color) # Inverse plot for rainfall
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx() # Instantiate a second axes that shares the same x-axis
color = 'tab:red'
ax2.set_ylabel('Water Level (feet)', color=color)
ax2.plot(water_level_3A11.index, water_level_3A11, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout() # To ensure no overlap of y-axis labels
plt.title('Rainfall and Water Level for Gauge 3A11')
plt.show()
Rainfall and water levels preview

Preparing Your Dataset

Normalisation and Splitting

Now that we have a visualisation of our data, the next important step is to normalise it. Normalisation is very important, especially when it comes to LSTM models. It makes sure that data with different scales of magnitude do not change the way the model is trained. We help the model learn better by making all of our traits fall within the same range.

We use a method called Min-Max Scaling, which is widely used, to predict the water level. This method scales our data within a set range, usually 0 to 1. This makes the training process more stable and easy to compare. Using the MinMaxScaler from sklearn.preprocessing, here’s how we do it:

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Initialize the scaler
scaler = MinMaxScaler()

# Combine and scale the data
combined_data = pd.concat([rainfall_3A11, water_level_3A11], axis=1)
scaled_data = scaler.fit_transform(combined_data)

Scaling is not just about adjusting the data values; it’s about setting a solid foundation for our model to understand and learn from the data efficiently.

Dividing the Data: Train, Validate, Test

A model is only as good as the data it learns from and, importantly, as the unseen data it gets tested on. Thus, splitting our dataset into training, validation, and testing segments is crucial:

  • Training Data (70%): This is the dataset our model will learn from.
  • Validation Data (20%): This dataset helps us tune the model’s hyperparameters and prevent overfitting.
  • Testing Data (10%): Reserved as unseen data, this set is used to evaluate the model’s performance, offering insights into how it might perform in the real world.

Here’s the code snippet for splitting the scaled data:

# Split the scaled data
total_samples = len(scaled_data)
train_end = int(total_samples * 0.7)
validation_end = int(total_samples * 0.9)

train_data = scaled_data[:train_end]
validation_data = scaled_data[train_end:validation_end]
test_data = scaled_data[validation_end:]

Making sure the data is correct: a statistical check

Before we start building models, there is a very important step that needs our full attention: making sure that our datasets are statistically correct. Why is this so important? Imagine teaching a model with data that is very different from the data you use for validation and testing. These kinds of differences could stop the model from learning, which would make it almost impossible to get a good fit. Even worse, it could mean testing a model on data that is outside of the area it was trained on, which could lead to predictions that are not accurate. To avoid these problems, we need to make sure that the statistical features of our training, validation, and testing groups are the same. This needs a simple but informative analysis that focuses on four key metrics: the mean, the maximum, the minimum, and the standard deviation. These measures give us a look at whether our datasets are aligned or if they need to be changed.

This is the code:


# Convert arrays back to DataFrames for easier manipulation and interpretation
columns = ['Rainfall', 'WaterLevel'] # Adjust as per your actual data columns
train_df = pd.DataFrame(train_data, columns=columns)
validation_df = pd.DataFrame(validation_data, columns=columns)
test_df = pd.DataFrame(test_data, columns=columns)

# Calculate statistical properties for each dataset
stats_train = train_df.describe().loc[['mean', 'std', 'min', 'max']]
stats_validation = validation_df.describe().loc[['mean', 'std', 'min', 'max']]
stats_test = test_df.describe().loc[['mean', 'std', 'min', 'max']]

# Compare the statistics
print("Training Data Stats:\n", stats_train, "\n")
print("Validation Data Stats:\n", stats_validation, "\n")
print("Testing Data Stats:\n", stats_test, "\n")
Comparing traing, validation and testing basic stats

Shaping Data for the LSTM Model

One thing that makes working with LSTM (Long Short-Term Memory) networks different is that they need data to be organised in a certain way. Because LSTMs like sequences so much, they are great for predicting time series. However, this means that our information needs to be changed into a list of inputs and outputs, with each list representing a window into the past.

Making Patterns for Prediction

To leverage TensorFlow and LSTM models, we must convert our time series data into sequences. Each sequence will contain a set of features from previous time steps, and the target will be the water level at the next time step. This setup mimics the process of learning from the past to predict the future, a core principle in time series forecasting.

Here’s how we achieve this transformation:

def create_sequences(input_data, n_steps, n_ahead=1):
X, y = [], []
for i in range(len(input_data) - n_steps - n_ahead + 1):
# Define the end index for the input sequence and output value
end_ix = i + n_steps
# Extract the sequence of features (excluding the first column if it's rainfall)
# and the target value(s) (last column for WL)
seq_x = input_data[i:end_ix, :] # Excludes first column (rainfall) and last column (target WL)
seq_y = input_data[end_ix, 1] # Target WL, one step ahead
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)

In this function:

  • input_data is your dataset, structured so that the last column represents the water level (WL) we aim to forecast.
  • n_steps is the window size, or how many past observations we consider in each sequence.
  • n_ahead allows for flexibility in forecasting more than one step ahead, though we'll focus on predicting just the next step (n_ahead=1).

Applying the Sequence Creation Function

With our function ready, we can now prepare our training, validation, and testing datasets:


n_steps = 20 # This is the length of the input sequences (e.g., 20 days)

# For forecasting 1 time steps ahead
X_train, y_train = create_sequences(train_data, n_steps=20, n_ahead=1)
X_validation, y_validation = create_sequences(validation_data, n_steps=20, n_ahead=1)
X_test, y_test = create_sequences(test_data, n_steps=20, n_ahead=1)

This approach ensures that each input sequence in X_train, X_validation, and X_test is paired with the correct target value in y_train, y_validation, and y_test, respectively. It's a meticulous preparation that sets the stage for our LSTM model to learn the dynamics of water levels influenced by past conditions.

3. Setting up the LSTM Model

We start our predicting process by setting up the LSTM model. This is the most important step after preprocessing the data. This step is not just putting the model together; it is also fine-tuning it to learn from our sequences and make accurate guesses without becoming too perfect.

Defining the Model Architecture

The architecture of our LSTM model is straightforward yet powerful. It consists of a single LSTM layer followed by a dense layer that consolidates the LSTM outputs into our forecast. Here’s a closer look at the setup:


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Define the LSTM model
model = Sequential([
LSTM(50, activation='relu', input_shape=(20, 2)), # '2' represents the number of features in each time step
Dense(1)
])
model.compile(optimizer='adam', loss='mse')

In this configuration:

  • The LSTM layer is designed with 50 units (or neurons), each learning different aspects of the data during the training process. The activation='relu' parameter introduces non-linearity, helping the model capture complex patterns.
  • The input_shape=(20, 2) specifies that each input sequence consists of 20 time steps, each with 2 features, aligning perfectly with our prepared data.
  • The Dense layer at the end is crucial for condensing the LSTM output into a single predictive value — our next time step's water level.

Optimizing and Calibrating the Model

With our model defined, the next step involves training it on our sequences, simultaneously monitoring its performance on the validation set to avoid overfitting. This process is encapsulated in the model.fit function:

history = model.fit(
X_train, y_train,
epochs=20,
validation_data=(X_validation, y_validation)
)

During this training phase, the model iteratively adjusts its weights, seeking to minimize the loss — the difference between its predictions and the actual water levels. By setting epochs=20, we're specifying that this adjustment process should repeat 20 times, providing ample opportunity for the model to learn.

Checking results

As the model trains, you’ll witness a live report of its progress. Each epoch reveals the model’s current loss on the training data and its performance on the validation set.

Results from the calibration

The Power of Visualization

Visualising the training and validation loss is one of the easiest and most useful ways to judge how well our model is doing during the training phase. Here are some metrics that show how well the model is learning and applying what it has learned:


# Assuming 'history' is the result of your model.fit() call

# Plot the training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Training History')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

In this plot:

  • The Training Loss curve reflects how the model’s prediction error decreases on the training dataset as it learns with each epoch.
  • The Validation Loss curve shows how the model performs on a dataset it hasn’t seen during training, offering insights into its generalization ability.
History of the calibration process

Interpreting the Graph

A few key observations can be drawn from this visualization:

  • Rapid Decrease in Loss: Initially, both training and validation loss should decrease rapidly, indicating that the model is learning effectively.
  • Convergence of Loss Curves: Ideally, we’d like both curves to converge to a low value, signifying that the model is not only learning well but also generalizing well to unseen data.
  • Divergence of Curves: If the training loss continues to decrease while the validation loss starts to increase, it’s a classic sign of overfitting. The model is memorizing the training data, harming its ability to generalize.

Lessons from the Learning Curve

Visualizing the model’s training history is not merely about confirming that it learns — it’s about understanding how it learns. It guides us in fine-tuning the model, whether by adjusting the number of epochs, experimenting with different architectures, or implementing techniques to combat overfitting, such as dropout or regularization.

This plot will help you visually assess:

  • Overfitting: If the training loss continues to decrease while the validation loss begins to increase, the model may be overfitting to the training data.
  • Underfitting: If both training and validation loss remain high, the model may be underfitting and not learning the underlying patterns in the data well.
  • Learning Rate: A very “jagged” loss graph might suggest the learning rate is too high, whereas a prolonged decrease in loss might suggest the learning rate is too low.
  • Optimal Epochs: This helps identify the point of diminishing returns, where the model stops improving significantly with more epochs.

If you notice any issues, like overfitting or underfitting, consider:

  • Adjusting the model’s complexity (adding more layers or units, or perhaps reducing them).
  • Implementing regularization techniques (like Dropout).
  • Tuning hyperparameters (like the learning rate).
  • Experimenting with different optimizers.
  • Enhancing your dataset (more data, feature engineering).

Remember, the goal is to achieve a balance where both the training and validation losses are minimized, and the gap between them is narrow, indicating that the model generalizes well to unseen data.

4. Performance

Once we have trained our LSTM model, the next important step is to evaluate how well it did. Loss and other number metrics can tell us a lot, but we need to go deeper to really understand how well our model predicts.

test_loss = model.evaluate(X_test, y_test)

Putting Unseen Data to the Test The real test of our model is how well it can guess water levels based on data it has not seen before. These are the situations where our testing set comes in handy, since it is like the real world:

Visualizing Predictions vs. Reality

A comprehensive evaluation of model performance involves comparing its predictions against actual values and scrutinizing the errors made. Such an analysis not only highlights the model’s forecasting accuracy but also uncovers patterns in its errors, guiding future improvements.

Here’s a specialized function designed to visualise both the predicted and true data, along with the errors.

def plot_predictions_with_error(true_data, predicted_data, title):
plt.figure(figsize=(10, 12))

# Flatten arrays to ensure compatibility with matplotlib
true_data_flatten = true_data.flatten()
predicted_data_flatten = predicted_data.flatten()

# Calculate error
error = true_data_flatten - predicted_data_flatten
positive_error_std = np.std(error[error >= 0])
negative_error_std = np.std(error[error < 0])

# Plot true data vs. predicted data
ax1 = plt.subplot(2, 1, 1) # 2 rows, 1 column, 1st subplot
ax1.plot(true_data_flatten, label='True Data')
ax1.plot(predicted_data_flatten, label='Predicted Data', alpha=0.7)
ax1.set_title(title)
ax1.set_ylabel('Value')
ax1.legend()

The function above can be called using these lines

plot_predictions_with_error(y_train, train_pred, 'Training Data: True vs Predicted')
plot_predictions_with_error(y_validation, validation_pred, 'Validation Data: True vs Predicted')
plot_predictions_with_error(y_test, test_pred, 'Test Data: True vs Predicted')
Errors during training

Validation

Testing

Looking Ahead: Finding Your Way in Forecasting

Now that we have finished talking about how to use LSTM networks to predict water levels, it is important to remember that this is just the beginning of a huge field of environmental data modelling that has a lot more to offer.

Opening Up New Areas: Lead Time and Complexity Adding more time to our forecasts seems like a natural next step from our present model. Predicting even further into the future is definitely harder, but it helps a lot with planning and making decisions. Longer lead times, on the other hand, mean dealing with more doubt and complexity. To make this big step forward, we need to fine-tune our models, do more complex feature engineering, and maybe even look into more complex structures.

Synergy in Numbers: Using More Than One Gauge As we learn more, it becomes clear how important it is for everyone involved — not just researchers — to work together. Adding more than one gauge to our model, which is similar to getting different points of view, can make our predictions more reliable and accurate. By using the combined power of many monitoring sites, we can find bigger patterns, learn more about how things work in the region, and make a more accurate prediction.

The Coming Together of Space and Time: Bridging Dimensions Models that can travel through both time and space are something that the future holds. Now we will talk about Convolutional LSTM (ConvLSTM) networks, which are a mix of time and space that is meant to be useful. With this more advanced method, we can include data from nearby places, which gives us a fuller picture of how the environment affects and interacts with other things. Adding spatial data to our models gives them more depth and lets them show how complex dynamics change over time and place.

--

--

Gerald A. Corzo
Hydroinformatics

Associate professor at IHE Delft in the Netherlands, his research work focuses on machine learning applications for water resources systems.