Richter’s Predictor: Modeling Earthquake Damage

Favour Oyelami
Analytics Vidhya
Published in
8 min readJun 16, 2020

Using data science to model the severity of building damage post-earthquake

Photo Source CircleOfBlue

The use of data science to help identity issues and solve business problem in our world today can not be overemphasized. From problems originating from social, health and environmental challenges etc, data science can help identify the severity of this problems and there impacts and provide possible solutions to tackle them.

One of the many issues that has recently become a topic of sustained interest, is how to investigate earthquake damage on buildings post-earthquake. As a result we decided to research around this problem , interestingly i came across this problem through and ongoing competition hosted by driven data, a platform that brings cutting-edge practices in data science and crowd sourcing to some of the world’s biggest social challenges and the organizations taking them on.

Case Study

Following the event of large earthquakes , a lot of field investigations on damage to buildings are carried out, due enormous amount of building and and there varieties, it becomes a difficult task to investigate each of the building affected post earthquake. However having a level of description about the building can help us determine the level of damage caused to the building.

About the Project

Photo Source:Relief Web

Following the 7.8 Mw Gorkha Earthquake on April 25, 2015, Nepal carried out a massive household survey using mobile technology to assess building damage in the earthquake-affected districts. Although the primary goal of this survey was to identify beneficiaries eligible for government assistance for housing reconstruction, it also collected other useful census-level socio-economic information.

The Goal

To propose a new method for describing the severity of building damage to help investigators to classify building damaged without a gross error

The solution to the case given will follow the CRISP-DM approach which includes;

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation & Analysis
  4. Data Modelling/Validation
  5. Result
  6. Deployment

Business Understanding

Based on aspects of building location and construction, our goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

Data Understanding

The data was collected through surveys by Kathmandu Living Labs and the Central Bureau of Statistics, which works under the National Planning Commission Secretariat of Nepal.

The dataset mainly consists of information on the buildings structure and their legal ownership. Each row in the dataset represents a specific building in the region that was hit by Gorkha earthquake.

Building Data Head

There are 39 columns in this dataset, where the building_id column is a unique and random identifier.

Data Preparation & Analysis

  • Data Preparation

To get started, lets start with the with target variable; damage grade, which came as and ordinal variable i.e 1,2,3, with one(1) representing low damage, two(2) representing a medium level of damage and (three) representing almost complete destruction to the building that was hit by the earthquake.

I used a map function to map each item to a categorical name to make it suitable for analysis

Target Variable

Secondly i check if the datasets contains duplicate and missing value, it contains no null value and duplicates

Missing values and duplicates

Exploratory Data Analysis

This stage involves exploring the data to gather insights that will be useful for the model to learn from the data to help improve the performance of the model.

Univariate Analysis

To begin with, lets take a look the distribution of target/dependent variable, damage grade,

Target Variable Damage

From the above about 56.89% of the damage grade on the buildings has a severity level medium, followed by high with 33.47% and low 9.64%. This implies that majority of te building in Nepal has a medium and high level damage following the earthquake, with only few buildings having low damage grade level

Now we went further to look at the relationship between the dependent variable and the independent variable

We begin by checking the relationship between the damage grade and the geographic region in which building exist.

Damage grade and geographic region of building

From the above, we begin by looking at building that exist at geographic level one(1) it appears that most building in this geographic region has a low damage grade with about 75% of the building residing in geographic level 9 to 26, also damage grade on building residing in geographic level two has low damage grade similar to geographic level 1, with about 75% of the building residing from 200 and above. Lastly buildings in geographic level 3 has similar damage grade across all geographic levels.

Next we explore the damage grade on number of floors in the building

Number of floors

From the above graph, we observe that buildings with 2 floors has a significant damage grade, followed by buildings with 3 floors and a floor. We also observed that building with 2 floors has a medium damage grade followed by a high damage grade, this also reflect across buildings with 3 floors and a floor.

Next we look at the relationship between age of building and damage grade

Age and damage grade

A significant observation to note from the above chart shows that building with with less than 50 years of age has a dominating damage grade that is medium, with notable increase of damage on buildings between zero to 20 years of age, with a steady decline from 25 years of age to 45 years of age,another interesting observation here is that building with 100 years and 150 years also has a increase in damage grade with severity that is medium. This implies that age has a major factor in determining the level of damage that is going to affeat the building

Next we look at the relationship between normalized area and height of building

Normalized area and height of building in percentages

The above charts illustrate the distribution damage grade on area and height of building. We can see here that most of the damage grade are low damage grade and high damage

Next we look at the relationship between the damage grade and other categorical variables in the datasets

Notable Insights

  1. Looking at the land surface condition of the building, the type T has a significant impact on the severity of damage on the building
  2. Looking at the foundation type, we can observe that floor type of the value R has a significant impact on the severity of damage on the building
  3. Also ground floor type of the type F has a significant impact on the level of damage on the building

Many of the of the findings here are observational as we can see, and it show they are going to have a significant impact when building our model

Data Modeling

This stage involves training a machine learning model with all the listed features to make predictions for the target variable Damage_Grade. We selected damage grade as the target feature given the goal of the project to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal

To measure the performance of our algorithms, we’ll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.

Macro F1 Score

Before the model is fitted on the data necessary feature transformation was performed on the data which include the following though not exhaustive;

  1. Feature normalization
  2. Getting dummy variable from categorical features
  3. Dropping features etc.

The data modeling involved the following phases ;

  • Splitting our dataset in training and test set to be used to select our baseline model
  • Cross-validation with Synthetic minority oversampling method
F1_micro Score

From the above, after trying out different algorithm together with the synthetic minority oversampling method and performing cross validation it appears the model average is the same, i then proceed to randomly choose one of the algorithm as our base line model, which is the XGBoost

  • Selecting a baseline model
  • Lastly we build a pipeline for easy workflow of model

Deployment

The Deployment of Machine Learning or Data Science Solution can vary from a web application, mobile application, storytelling in the form of Data visualization for stakeholders, or technical reports for a Manager or superior, so the deployment will be dependent on how it will be utilized.

The deployment required for the competition & this project is in the form of storytelling using a blog or article. A web application may be considered in the future as well as further analysis to discover insights that will be useful in training the machine learning model to improve the current performance of the model in making predictions for unseen future data.

Thank you for the time

Please don’t forget to clap.

To see more about this analysis, see the link to my Github available here

Connect with me on Twitter

Connect with me on linkedin

--

--

Favour Oyelami
Analytics Vidhya

Data Scientist | AWS community Builder | Linkedin: Favour Oyelami