Nepal Earthquake Damage Grade Modeling

Dxzhang
14 min readDec 7, 2021

In 2015, the capital of Nepal experienced a 7.8 magnitude earthquake. The earthquake took the lives of many, destroyed thousands of buildings, and caused an estimated damage of $10 billion USD. Through multiple models, we are able to predict the damage grade of 72.8% of Nepal buildings and extract important features that helped determine building damage grade. Using our work we put together a plan detailing how our work can be used to mitigate damage from future earthquakes.

Introduction

In this post, we will walk through our methods and steps we took to model and understand the damage building structures sustained during the devastating earthquake that struck Nepal in 2015. The 7.8 magnitude earthquake that took place in the capital of the country left 9,000 people dead and injured another 22,000. The damage to the country is estimated to be around 40% of the countries GDP, with 10 billion in damage and a GDP in 2015 of 24.5 billion.

Our goals for this research were twofold. First, gain a better understanding of why certain buildings sustained more damage than others. Which attributes offered the most explanation? Was there something the buildings that sustained the most damage had in common? Could we recommend any strategies to mitigate this level of damage in the future? And secondly, build a model that offered the best understanding of how much damage a given building would sustain.

Data

The data used for this project came from an ongoing drivendata.com competition and was very extensive in both the number of records (number of buildings) and number of attributes for each building containing 260,606 records and 38 columns.

We first tried to do an EDA for the dataset and get some insight into the dataset.

EDA

Correlation

We first ran a correlation plot for the dataset.

Correlation for the dataset

Outlier

Age Column Distribution

We found out that the age column has 1363 rows with an age of 995 years old. Then we tried to look into the value to determine whether the value were entered incorrectly on purpose. When we turned to Google to dive into this data, we learned that there is a historical district in Nepal and the 995 represents all of the buildings in the district.

Search for Nepal & 995

Most of the ancient buildings were destroyed during the earthquake. The mean and median barely change when we remove the outliers since these rows make .5% of our entire dataset, so we decided to keep these rows.

Data preprocessing

Scaling

We applied the standard scaler for all the numerical columns.

scaler = StandardScaler()train_set_combined[[‘geo_level_1_id’, ‘geo_level_2_id’ , ‘geo_level_3_id’, ‘age’, ‘area_percentage’, ‘height_percentage’, ‘count_floors_pre_eq’]] = scaler.fit_transform(train_set_combined[[‘geo_level_1_id’, ‘geo_level_2_id’ , ‘geo_level_3_id’, ‘age’, ‘area_percentage’, ‘height_percentage’, ‘count_floors_pre_eq’]])

One hot encoding

We applied one hot encoding for all the categorical variables and combined the columns with the original dataset.

dummy = [‘land_surface_condition’, ‘foundation_type’, ‘roof_type’,’ground_floor_type’,’other_floor_type’,’position’,’plan_configuration’,’legal_ownership_status’]dummy_col = pd.get_dummies(train_set[dummy])train_set_combined = pd.concat([train_set, dummy_col],axis=1)train_set_combined.drop(dummy, axis=1, inplace=True)

No null data

We also checked the null values of the dataset and found out there are no null values in the dataset.

Predictive modeling

Regression Analysis

Regression analysis is the simplest method of analyzing and estimating the relationships between a dependent variable and independent variables, hence we proceeded with regression as the first method of analysis to try and pick up more relevant variables from the dataset.

The data set was divided into train and test data with a ratio of 80:20. The regression models were built in the training data and the testing data was used to identify the accuracy of the models.

Since the outcome variable, damage grade was an ordinal variable with 3 levels (1–3), the following methodology was followed:

1. For simple linear regression and regression with regularization, the dependent variable was treated as continuous variable and after the prediction of values, the result was rounded off to obtain the grade values to determine accuracy

2. In addition to simple linear regression, ordinal regression was also performed, as it was determined that it would be the best fit for these particular variables

In total, the following 8 regression models were created:

  • Simple Linear Regression
  • Lasso Regression
  • Ridge Regression
  • Elastic Net Regression
  • Ordinal Regression with the following different link functions:

​ ​ ​​ — logit

​ ​ ​​ — Probit

​ ​ ​​ — Exponential

​ ​ ​​ — Complementary Log Log

Results:

The model results were as follows:

  • Generalized Linear Models:

​ ​ ​​ — Accuracy: 57.5%

​ ​ ​​ — R-Squared: -13.07%

  • Ordinal Regression Models:
  • Probit:

​ ​ ​​ — Accuracy: 57.8%

​ ​ ​​ — R-Squared: -13.02%

  • Logit:

​ ​ ​​ — Accuracy: 57.9%

​ ​ ​​ — R-Squared: -13.03%

  • Exponential:

​ ​ ​​ — Accuracy: 57.35%

​ ​ ​​ — R-Squared: -13.34%

  • Complementary Log Log:

​ ​ ​​ — Accuracy: 57.76%

​ ​ ​​ — R-Squared: -13.47%

The regression models had a very poor fit on the data, as can be evidenced by the very low accuracy scores and negative R-squared value. The negative R-squared value indicates that the model does a poorer job than the baseline model at explaining the variance of the data, which is an indictment of the poor model performance. In addition, the models also had a very high RMSE value, especially considering that the scale for the outcome variable went only to 3. Another interesting observation is that even with regularization, the linear models did not have any noticeable improvement over simple linear regression, which is another indicator of this method’s inability to explain this particular dataset. Of the regression models, ordinal regression with logit link had the best accuracy of 57.9%. The most important variables from this model were determined to be the number of floors in the building and number of families living there. As a result of the poor performance of the regression models, tangible conclusions could not be drawn, and hence, further steps involve more complex models such as random forests, KNN, and neural nets.

Random forest

Once we ran regression we wanted to try our hand at decision trees and random forest models. Decision trees are very simple and are easily interpretable. Since we don’t know what’s the best depth for our trees, we run our model through a for loop to choose a depth between 1 and 25.

for i in range(1,26,1):  decisionTree = DecisionTreeClassifier(criterion=’entropy’, max_depth=i)  decisionTree = decisionTree.fit(X_train,y_train)  y_pred = decisionTree.predict(X_test)  print(“Depth: “,i,” Accuracy:”,metrics.accuracy_score(y_test, y_pred))

From this model, we learned that a depth of 13 does the best job of classifying the data in the test set with an accuracy of 69.27%. Once our model’s accuracy peaks at a depth of 13, it starts to fall and we see the effects of overfitting. An advantage of using decision trees is that you can extract the most important features.

The most important features are

  • Geo_level_1_id
  • Foundation_type_r
  • Superstructure_mud_mortar_stone
  • Foundation_type_ i
  • Geo_level_2_id
  • Age

When we dive in-depth more we realize that foundation_type_i and superstructure_mud-mortar_stone are so important because they have high damage grade 1 purity when it comes to splitting trees. By using these as the splits for your trees, you can classify 80%+ of grade 1 buildings.

Decision trees are very quick and easy to understand, but sometimes they can be too simple. To get a little more complex, we run a Random Forest Classifier, which runs multiple trees and takes the most common result as its prediction. This ensemble method introduces a little bit of bias and reduces the variance. The code for our Random Forest model is similar to our decision tree model, but instead of tree depth, we’re testing for the correct number of trees to use.

for i in range(5,100,5):  rf_clf = RandomForestClassifier(n_estimators = i)  rf_clf.fit(X_train, y_train)  rf_pred = rf_clf.predict(X_test)  print(“N_Trees: “, i , “Accuracy score: “, accuracy_score(rf_pred, y_test))

We saw there was a small and steady improvement until 45 trees, where accuracy plateaued afterwards. The best accuracy we achieved using random forest is 71.64%. Our biggest takeaway from running these models is that decision trees seem to be a step in the right direction, and we should use them in more complex models to see if we can achieve higher accuracy.

Boosting

XGBoost and CatBoost are actually the two models that achieve the highest accuracy for our dataset.

xgb_clf = XGBClassifier()xgb_clf.fit(X_train, y_train)xgb_pred = xgb_clf.predict(X_test)accuracy_score(xgb_pred, y_test)cat_clf = CatBoostClassifier()cat_clf.fit(X_train, y_train)cat_pred = cat_clf.predict(X_test)accuracy_score(y_test, cat_pred)
Confusion Matrix for XGBoost
Confusion Matrix for CatBoost

We tried to let the boosting function automatically fit the training data and predict on the test data. Both boosting models are using tree based model to train the data and both of them have a tree depth of 6. The only difference between the 2 boosting models is the learning rate. The XGboosting model is using a higher learning rate than the catBoosting method. The accuracy from the XGBoost model is 72.8% and 72.4% from the CatBoost model. From the confusion matrix, we are able to tell that both models have the same accuracy for the damage grade level 2. However, both of the models are not doing that great on level 1 and level 3.

KNN

After the data is preprocessed, we built a K Nearest Neighbors model and started with the number of neighbors being 10. In the first KNN model, we achieved an accuracy score of 66.3%

knn = KNeighborsClassifier(n_neighbors=10)knn.fit(X_train, y_train)ypred=knn.predict(X_test)accuracy_score(y_test[‘damage_grade’].tolist(), y_pred)

To analyze if our initial number of neighbors is a good parameter, we fitted 20 KNN models of the number of neighbors being from 1 to 20 and investigated their corresponding confusion matrix and accuracy score.

accuracy = []for i in range(1,20):  knn = KNeighborsClassifier(n_neighbors=i)  knn.fit(X_train, y_train)  ypred=knn.predict(X_test)  y_pred=[]  for j in range(len(ypred)):    y_pred.append(ypred[j][1])  accuracy.append(accuracy_score(y_test[‘damage_grade’].tolist(), y_pred))confusion_matrix(y_test[‘damage_grade’].tolist(), y_pred)print(classification_report(y_test[‘damage_grade’].tolist(), y_pred))

Based on the plotted accuracy chart, starting from the model fitted with 6 neighbors, model accuracy scores fluctuate around 66%. The model with the highest accuracy score is fitted with 15 neighbors, which achieved an accuracy score of 66.5%

KNN Accuracy

Neural Network — Classification

First, we want to build a neural network that focuses on classification. We built 2 neural networks models using sequential keras. We trained both models with 50 and 500 epochs. The output layer is set to output 4 nodes with a Softmax activation function.

For the first model, we used 2 hidden layers. The first layer is set to have 64 neurons and a Sigmoid activation function. The second layer is set to have 64 neurons and a ReLU activation function.

NN_clf = keras.Sequential([keras.layers.Flatten(),keras.layers.Dense(64,activation = tf.nn.sigmoid),keras.layers.Dense(64,activation = tf.nn.relu),keras.layers.Dense(4,activation = tf.nn.softmax)])NN_clf.compile(optimizer = ‘adam’,loss=’sparse_categorical_crossentropy’,metrics =[‘accuracy’])

For the second model, we used 6 hidden layers. The first layer is set to have 64 nodes and a Sigmoid function. The second layer is set to have 64 neurons and a ReLU function. The third layer is set to have 32 neurons and a Tanh function. The fourth layer is set to have 32 neurons and a ReLU function. The fifth layer is set to have 16 neurons and a sigmoid function. The sixth hidden layer is set to have 16 neurons and a ReLU function.

NN_clf_1 = keras.Sequential([keras.layers.Flatten(),keras.layers.Dense(64,activation = tf.nn.sigmoid),keras.layers.Dense(64,activation = tf.nn.relu),keras.layers.Dense(32,activation = tf.nn.tanh),keras.layers.Dense(32,activation = tf.nn.relu),keras.layers.Dense(16,activation = tf.nn.sigmoid),keras.layers.Dense(16,activation = tf.nn.relu),keras.layers.Dense(4,activation = tf.nn.softmax)])NN_clf_1.compile(optimizer = ‘adam’,loss=’sparse_categorical_crossentropy’,metrics =[‘accuracy’])

The first neural network has an accuracy of 68% with 50 epochs training, and the second neural network has an accuracy of 68.4%.

When the neural network were trained on 500 epochs, we got a lower score when compared with the result of 50 epochs. The reason is that the neural network was overfitting with 500 epochs.

Neural Network — Regression

Next we wanted to see if using a neural network for regression would change our accuracy. We changed the last output layer to just 1 output with linear activation function.

NN_clf_1 = keras.Sequential([keras.layers.Flatten(),keras.layers.Dense(64,activation = tf.nn.sigmoid),keras.layers.Dense(64,activation = tf.nn.relu),keras.layers.Dense(32,activation = tf.nn.tanh),keras.layers.Dense(32,activation = tf.nn.relu),keras.layers.Dense(16,activation = tf.nn.sigmoid),keras.layers.Dense(16,activation = tf.nn.relu),keras.layers.Dense(1,activation = ‘linear’)])NN_clf_1.compile(optimizer = ‘adam’,loss=’mean_squared_error’,metrics =[‘mean_squared_error’])NN_clf_1.fit(X_train, y_train,epochs = 50)

Then we tried to set the threshold value for the damage grade classification. We first tried to use 2.5 as the threshold value for damage grade 3 and 1.5 as the division for level 1 and level 2. Using those threshold the model got an accuracy of 68.9%.

Neural Network Regression Prediction

Looking at that confusion matrix, the model is doing a good job predicting damage grade 2, but that makes sense considering more than 50% of the rows have a damage grade of 2. However, we are not doing good in predicting damage grades 1 and 3. Therefore, we tried to narrow down the range and set thresholds of 2.3 and 1.9 for damage grade 2.

Neural Network Regression Prediction 2

From the confusion matrix, we can actually see that there is a tradeoff for predicting the damage grade 1 and 3 better. However, we are doing poorly on damage grade 2, since it now has a narrow range. We tried to do a while loop and find the best cutoff for the highest accuracy score. The best cutoff thresholds are 1.55 and 2.54, where the model achieves an accuracy of 69%.

j_list = np.linspace(2,3,100)best_i = 0best_j = 0best_ = 0for j in j_list:  i = 1  while i <= j:    NN_reg_class = []    for num in NN_pred_1:      if num >=j:        NN_reg_class.append(3)      elif num >=i and num <j:        NN_reg_class.append(2)      elif num <i:        NN_reg_class.append(1)    score_ = accuracy_score(y_test, NN_reg_class)if score_ >= best_:  best_ = score_  best_i = i  best_j = ji += 0.01

PCA(compression)

Since we have 64 features in our dataset, we would need a large amount of data to train our model. Even though we have 260,000 rows of data, we think it would be helpful to see if dimension reduction can help us improve our accuracy score.

pca_30 = PCA(n_components=len(train_set_combined.columns), random_state=43)X_pca_30 = pca_30.fit_transform(train_set_combined)for ix, char in enumerate(np.cumsum(pca_30.explained_variance_ratio_*100)):print(ix+1, char)

From the results, we found that we can explain 66% of the data variance using 6 principal components. With an addition of 18 components, the cumulative variance increased by just 18%. Therefore, we decided to transform the dataset into 6 dimensions with the fitted transformer.

pca_6 = PCA(n_components =6, random_state=42)X_pca_6 = pca_6.fit_transform(train_set_combined)X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca_6,train_label, test_size=0.2, random_state=42)

We first tried to fit a logistic regression with the transformed dataset. We used the GridSearchCV to iterate through different parameters.

log_param = {‘multi_class’: [‘multinomial’],‘solver’: [‘lbfgs’, ‘liblinear’, ‘sag’]}log_clf_pca= GridSearchCV(LogisticRegression(), log_param)log_clf_pca.fit(X_train_pca, y_train_pca)log_pred_pca = log_clf_pca.predict(X_test_pca)accuracy_score(y_test_pca, log_pred_pca)

For the transformed dataset, our logistic regression model only predicted an accuracy score of 56.4%.

Then we also ran a 6 hidden layer neural network with the transformed dataset.

We got an accuracy of 63.2% for the test dataset.

From the accuracy score, it seems that dimension reduction is not helping with our accuracy. With such a huge dataset, it is better to keep all of the features in the dataset. Also, since the PCA is compressing all the features into a major component, it is not worth the effort to run a boosting method, since the boosting method is expecting to see a large variance within the dataset.

Real-world Applications:

Based on the two major insights from our study, predictive modeling, and feature importance, we propose two real-world applications: a Building Assessment Tool and potential legislation policy for future construction.

From our best predictive model, we are able to successfully predict damage levels of 73% of the buildings. Based on our model, we can help Nepal assess the vulnerability of their building to earthquakes before the next earthquake hits. By inputting geographical, age, use, construction types, we can predict their damage level and take corresponding measures to minimize the damage through more meticulous inspections and maintenance procedures.

By identifying the most important features exposing buildings to earthquake damage, we can help legislators propose research-backed zoning laws that regulate the specific building use, structure, and construction materials in particular areas in Nepal where earthquakes hit most frequently. We believe our findings can contribute to the endeavors of local governments in Nepal to minimize the casualties and monetary loss.

Limitations/Next Steps:

Due to the limitations of features in the data we collected, although there were geographical levels to each of the buildings, we could not connect these levels to their exact longitude and latitudes. We were unable to locate the buildings to our dataset from cities, states, or townships in Nepal. We tried our best to locate the id to districst and the municipalities under the district.

We discovered that there are 31 unique Geo_level_1_id values. Then we found out that there were 31 districts impacted by the earthquake.

Damage due to the Gorkha earthquake was attributed to these four major shakings in central Nepal, which caused 8790 fatalities, 22,300 injuries, and affected 8 million people from 31 out of 75 districts in Nepal (NPC, 2015). — https://www.sciencedirect.com/topics/earth-and-planetary-sciences/gorkha-earthquake-2015

We now have a sense that the Geo_level_1_id may map to districts, but we still do not know how the id’s are mapped to the districts. Therefore, we tried to determine the number of level 1 variables within a level 1 variable using group by and nunique().

train_set.groupby(“geo_level_1_id”)[‘geo_level_2_id’].nunique()

We are able to locate two districts in nepal, Gorkha and Myadi. Geo_level_1_id 27 is mapped to the Gorkha district. From the search, we know that Gorkha has eleven municipalities and only Geo_level_1_id 27 has eleven unique Geo_level_2_id values. Therefore, we may conclude that Geo_level_1_id 27 represents Gorkha. We follow the same process for Myadi, and map it to Geo_level_1_id 3. When we look at a map we can not piece together the logic behind the Geographic ids and the physically location. Therefore, we have no way to map the id to the districts.

https://en.wikipedia.org/wiki/Gorkha_District

There are other factors such as culture, policies, and natural resources that could impact the buildings in each region. Since we could not tie the Geo_level ids to a certain location, we could not identify the socioeconomic status of each district. Having that data would allow us to see the relationship between socioeconomic status and damage grade. For example, we’d like to determine if certain building materials are more common in low socioeconomic status areas.

Since this dataset was limited to the damage caused by earthquakes in 2015, our models were trained with the assumption of the single epicenter of the earthquakes in that year. Certainly, the next epicenter would not be in the same location. As a result, we could not capture the impact of the earthquake from the epicenter.

Contributors

Shehzad Ali

Dongxuan Zhang

Mu-An Shen

Aatheep Gurubaran

Sam Worley

Link to GitHub: https://github.com/sworley1/AdvML-TermProject

--

--