Richter’s Predictor: Modeling Earthquake Damage
Using data science to model the severity of building damage post-earthquake
The use of data science to help identity issues and solve business problem in our world today can not be overemphasized. From problems originating from social, health and environmental challenges etc, data science can help identify the severity of this problems and there impacts and provide possible solutions to tackle them.
One of the many issues that has recently become a topic of sustained interest, is how to investigate earthquake damage on buildings post-earthquake. As a result we decided to research around this problem , interestingly i came across this problem through and ongoing competition hosted by driven data, a platform that brings cutting-edge practices in data science and crowd sourcing to some of the world’s biggest social challenges and the organizations taking them on.
Case Study
Following the event of large earthquakes , a lot of field investigations on damage to buildings are carried out, due enormous amount of building and and there varieties, it becomes a difficult task to investigate each of the building affected post earthquake. However having a level of description about the building can help us determine the level of damage caused to the building.
About the Project
Following the 7.8 Mw Gorkha Earthquake on April 25, 2015, Nepal carried out a massive household survey using mobile technology to assess building damage in the earthquake-affected districts. Although the primary goal of this survey was to identify beneficiaries eligible for government assistance for housing reconstruction, it also collected other useful census-level socio-economic information.
The Goal
To propose a new method for describing the severity of building damage to help investigators to classify building damaged without a gross error
The solution to the case given will follow the CRISP-DM approach which includes;
- Business Understanding
- Data Understanding
- Data Preparation & Analysis
- Data Modelling/Validation
- Result
- Deployment
Business Understanding
Based on aspects of building location and construction, our goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.
Data Understanding
The data was collected through surveys by Kathmandu Living Labs and the Central Bureau of Statistics, which works under the National Planning Commission Secretariat of Nepal.
The dataset mainly consists of information on the buildings structure and their legal ownership. Each row in the dataset represents a specific building in the region that was hit by Gorkha earthquake.
There are 39 columns in this dataset, where the building_id column is a unique and random identifier.
Data Preparation & Analysis
- Data Preparation
To get started, lets start with the with target variable; damage grade, which came as and ordinal variable i.e 1,2,3, with one(1) representing low damage, two(2) representing a medium level of damage and (three) representing almost complete destruction to the building that was hit by the earthquake.
I used a map function to map each item to a categorical name to make it suitable for analysis
Secondly i check if the datasets contains duplicate and missing value, it contains no null value and duplicates
Exploratory Data Analysis
This stage involves exploring the data to gather insights that will be useful for the model to learn from the data to help improve the performance of the model.
Univariate Analysis
To begin with, lets take a look the distribution of target/dependent variable, damage grade,
From the above about 56.89% of the damage grade on the buildings has a severity level medium, followed by high with 33.47% and low 9.64%. This implies that majority of te building in Nepal has a medium and high level damage following the earthquake, with only few buildings having low damage grade level
Now we went further to look at the relationship between the dependent variable and the independent variable
We begin by checking the relationship between the damage grade and the geographic region in which building exist.
From the above, we begin by looking at building that exist at geographic level one(1) it appears that most building in this geographic region has a low damage grade with about 75% of the building residing in geographic level 9 to 26, also damage grade on building residing in geographic level two has low damage grade similar to geographic level 1, with about 75% of the building residing from 200 and above. Lastly buildings in geographic level 3 has similar damage grade across all geographic levels.
Next we explore the damage grade on number of floors in the building
From the above graph, we observe that buildings with 2 floors has a significant damage grade, followed by buildings with 3 floors and a floor. We also observed that building with 2 floors has a medium damage grade followed by a high damage grade, this also reflect across buildings with 3 floors and a floor.
Next we look at the relationship between age of building and damage grade
A significant observation to note from the above chart shows that building with with less than 50 years of age has a dominating damage grade that is medium, with notable increase of damage on buildings between zero to 20 years of age, with a steady decline from 25 years of age to 45 years of age,another interesting observation here is that building with 100 years and 150 years also has a increase in damage grade with severity that is medium. This implies that age has a major factor in determining the level of damage that is going to affeat the building
Next we look at the relationship between normalized area and height of building
The above charts illustrate the distribution damage grade on area and height of building. We can see here that most of the damage grade are low damage grade and high damage
Next we look at the relationship between the damage grade and other categorical variables in the datasets
Notable Insights
- Looking at the land surface condition of the building, the type
T
has a significant impact on the severity of damage on the building - Looking at the foundation type, we can observe that floor type of the value
R
has a significant impact on the severity of damage on the building - Also ground floor type of the type
F
has a significant impact on the level of damage on the building
Many of the of the findings here are observational as we can see, and it show they are going to have a significant impact when building our model
Data Modeling
This stage involves training a machine learning model with all the listed features to make predictions for the target variable Damage_Grade
. We selected damage grade as the target feature given the goal of the project to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal
To measure the performance of our algorithms, we’ll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.
Before the model is fitted on the data necessary feature transformation was performed on the data which include the following though not exhaustive;
- Feature normalization
- Getting dummy variable from categorical features
- Dropping features etc.
The data modeling involved the following phases ;
- Splitting our dataset in training and test set to be used to select our baseline model
- Cross-validation with Synthetic minority oversampling method
From the above, after trying out different algorithm together with the synthetic minority oversampling method and performing cross validation it appears the model average is the same, i then proceed to randomly choose one of the algorithm as our base line model, which is the XGBoost
- Selecting a baseline model
- Lastly we build a pipeline for easy workflow of model
Deployment
The Deployment of Machine Learning or Data Science Solution can vary from a web application, mobile application, storytelling in the form of Data visualization for stakeholders, or technical reports for a Manager or superior, so the deployment will be dependent on how it will be utilized.
The deployment required for the competition & this project is in the form of storytelling using a blog or article. A web application may be considered in the future as well as further analysis to discover insights that will be useful in training the machine learning model to improve the current performance of the model in making predictions for unseen future data.
Thank you for the time
Please don’t forget to clap.
To see more about this analysis, see the link to my Github available here
Connect with me on Twitter
Connect with me on linkedin