Towards urban flood susceptibility mapping using machine and deep learning models (part 3): Random forest model

Omar Seleem
Hydroinformatics
Published in
6 min readDec 15, 2022

In the last article, we prepared a dataset to map urban flood susceptibility using point-based models such as random forest (RF), support vector machine (SVM) and artificial neural network (ANN). This article shows how to develop the models and use the trained model to map urban flood susceptibility. This series of articles summarize and explain (with python code) the paper “ Towards urban flood susceptibility mapping using data-driven models in Berlin, Germany” published in Geomatics, Natural Hazards and Risk. The complete Jupyter Notebook and sample data for flooded and non-flooded locations and the used predictive features in the paper are available here

Deep learning is a subset of machine learning that uses neural networks to mimic the learning process of the human brain. The literature is rich with papers showing deep learning models outperforming traditional machine learning models. Deep learning models showed their superiority in different fields where the available data size was large. However, machine learning models are preferable with small data sizes.

It is challenging to collect a reliable flood inventory to map flood susceptibility (e.g., Termeh et al., 2018 collected 53 flooded locations in an area of 5737 km2; Choubin et al., 2019: 51 locations in 126 km2; Zhao et al., 2020: 216 locations in 131 km2). New studies showed that machine-learning models outperformed deep-learning models for small datasets (Grinsztaj et al., 2022; Shwartz-Ziv and Armon 2022). Therefore, it is logical that machine learning models are more suitable for flood susceptibility than deep learning models.

Random forest

The random forest model consists of several individual decision trees. It divides the input dataset into several sub-samples and develops a decision tree model for each sub-sample. The final result is estimated based on the majority result of all the decision tree models (see figure below)

Random forest model

Now we read the dataset, check for no-value data and drop them. Then we have a look at the correlation between the predictive features.

import numpy as np
import cv2
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import geopandas as gpd

# Read the shapefile or pickle which we created in last article
df=gpd.read_file("points_data.shp")
# df=pd.read_pickle("points_data.pkl") # in case of pickle
df.head()

#check that there is no no data values in the dataset
print(df.isnull().sum())
#df = df.dropna() # use this to remove rows with no data values

#Understand the data
#Here we can see that we have a balanced dataset (equal number of flooded and non flooeded locations
sns.countplot(x="Label", data=df) #0 - Notflooded B - Flooded


# show the correlation matric for the dataset
corrMatrix = df.corr()
fig, ax = plt.subplots(figsize=(10,10)) # Sample figsize in inches
#sns.heatmap(df.iloc[:, 1:6:], annot=True, linewidths=.5, ax=ax)
sns.heatmap(corrMatrix, annot=True, linewidths=.5, ax=ax)

Your data frame should look like the figure below.

The prepared dataset from the last article. The values in the table are normalized (between 0 and 1) as I used the same dataset with an artificial neural network. However, it is necessary to normalize the data for the random forests.

The dataset needs to be split into dependent and independent variables. The dependent variable is the variable that needs to be predicted (column name = Label) while the independent variables are the predictive features. The values in the Label column are 1 for flooded locations and 0 for non-flooded locations. The column geometry denotes the longitude and the latitude of the points and is automatically in the dataset because the dataset is originally a points shapefile.

Now, we will make two variables (X and Y). Y includes the label (dependent variable) and X includes the predictive features (independent variables). Then, we will split the dataset (X and Y) into training (60 %), validation (20%) and testing (20%) datasets. The training dataset is used to train the model while the validation dataset is used in hyperparameter tuning, i.e. estimating the best parameter combination that optimizes the model performance. Finally, the testing dataset is used to evaluate the model performance. Hence, we test the model on a dataset that was not included in the training and validation processes.

#Define the dependent variable that needs to be predicted (labels)
Y = df["Label"].values

#Define the independent variables. Let's also drop gemotry and label
X = df.drop(labels = ["Label", "geometry"], axis=1)
features_list = list(X.columns) #List features so we can rank their importance later

#Split data into train (60 %), validate (20 %) and test (20%) to verify accuracy after fitting the model.
# training data is used to train the model
# validation data is used for hyperparameter tuning
# testing data is used to test the model

from sklearn.model_selection import train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, Y, test_size=0.2,shuffle=True, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25,shuffle=True, random_state=42)

Now we can train the random forest model. The model can be used for both classification and regression problems. However, flood susceptibility mapping is a classification problem as mentioned before. Therefore, we will use the RandomForestCalssifier from sklearn python module.

#RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state = 42) # I am using the default values of the parameters.

# Train the model on training data
model.fit(X_train, y_train)

# make prediction for the test dataset.
prediction = model.predict(X_test)

# The prediction values are either 1 (Flooded) or 0 (Non-Flooded)
prediction

# The AUC is considered one of the best performance indices
# We can plot the curve and calculate it
from sklearn.metrics import plot_roc_curve

ax = plt.gca()
model_disp = plot_roc_curve(model, X_test, y_test, ax=ax, alpha=0.8)
plt.show()

Please see this article for hyperparameter tuning. The random forest model has a built-in feature importance function which is implemented in scikit-learn python module. Hence, we can estimate which predictive features influence the model prediction.

# Estimate the feature importance
feature_imp = pd.Series(model.feature_importances_, index=features_list).sort_values(ascending=False)
print(feature_imp)

# Plot the feature importance
feature_imp.plot.bar()

Once you are satisfied with the model performance you can use the model to map flood susceptibility for your whole study area. We used the trained model to map flood susceptibility in Berlin. Firstly, we need a point shapefile for the whole study area as we did for the dataset in the last article.

# Read shapefile for the whole study area
df_SA=gpd.read_file("Study_area.shp")
df_SA.head() # make sure that the dataset has the same column arrangement as the training dataset

X_SA= df_SA.drop(labels = ["geometry"], axis=1) # we need to remove all the columns except the predictive features
X_SA.head()

prediction_SA = model.predict(X_SA) # predict if the location is flooded (1) or not flooded (0)

# In order to map the flood susceptibility we need to cacluate the probability of being flooded
prediction_prob=model.predict_proba(X_SA) # This function return an array with lists
# each list has two values [probability of being not flooded , probability of being flooded]

# We need only the probablity of being flooded
# We need to add the value coressponding to each point

df_SA['FSM']= prediction_prob[:,1]

Now we have a point shapefile which has the flood susceptibility of each location. We need to convert it to a raster. There are many options. We can do this step in ArcMap or QGIS or we can continue in python.

# Save the dataframe tp a shapefile in case of converting the points to raster using QGIS or Arcmap
df_SA.to_file("FSM.shp")
# Converting the point shapefile to raster.
# We will use the model prediction (column FSM in df_SA to make a raster)
from geocube.api.core import make_geocube
import rasterio as rio

out_grid= make_geocube(vector_data=df_SA, measurements=["FSM"], resolution=(-1, 1)) #for most crs negative comes first in resolution
out_grid["FSM"].rio.to_raster("Flood_susceptibility.tif")

Your final product would be a map like this

References

Choubin B, Moradi E, Golshan M, Adamowski J, Sajedi-Hosseini F, Mosavi A. 2019. An ensemble prediction of flood susceptibility using multivariate discriminant analysis, classification and regression trees, and support vector machines. Sci Total Environ. 651(Pt 2):2087–2096.

Grinsztajn, L., Oyallon, E., & Varoquaux, G. 2022. Why do tree-based models still outperform deep learning on tabular data?. arXiv preprint arXiv:2207.08815.

Shwartz-Ziv, R., & Armon, A. 2022. Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90.

Termeh SVR, Kornejady A, Pourghasemi HR, Keesstra S. 2018. Flood susceptibility mapping using novel ensembles of adaptive neuro fuzzy inference system and metaheuristic algorithms. Sci Total Environ. 615:438–451.

Zhao G, Pang B, Xu Z, Peng D, Zuo D. 2020. Urban flood susceptibility assessment based on convolutional neural networks. J Hydrol. 590:125235.

--

--

Omar Seleem
Hydroinformatics

Dr. -Ing | Hydrology | Data scientist | Machine learning