Earthquake Damage Prediction with Machine Learning — Part 1

Published in

Analytics Vidhya

3 min readNov 20, 2020

By — Jen Neng Ng

Photo by Jorge Fernández Salas on Unsplash

Machine learning is a tool to help data scientists perform prediction. In order to fully understand how to analyse a dataset in a given case scenario, inspecting the data distribution and drafting a comprehensive experiment plan is more toward delivering a better outcome.

As Kaggle has become the most favourite platform for a data scientist to learn from each other, many beginners will be stuck at the point on what to analyze and what to experiment with.

In this tutorial, I am going to assume you have some basic understanding of Machine learning from some online courses or coding background. As most of the online courses scenario is lacking data science view and straight to the coding, I will demonstrate a simple data science observation view in this article.

The basic flow of developing a machine learning algorithm is :

Part 1: Background and Related Work Research

Understand the domain and objective
Analyse and compare related work from different researcher

Part 2: Data Analysis

Select experiment programming language, model and validation method
Create a exploration data analysis (EDA) report

Part 3 & Part 4: Implementation

Data preprocessing
Draft experiment plan
Implementation with regard to experiments
Summarise accuracy score for different experiment
Understand the trade-off point
Submit solution

Environment Setup

Language: R

IDE: R Studio

Model: Decision Tree, Random Forest, XGBoost

Package: Caret, ggplot, UBL, rpart, psych, MLmetrics 1.1.3

Dataset: https://www.drivendata.org/competitions/57/nepal-earthquake/data/

Part 1: Background & Related Work Research

This case study is focusing on the competition and Gorkha (Nepal) earthquake dataset in DataDriven.org and Kaggle. The competition requires F1 Micro score as the performance metric. Also, this dataset is an abstract version dataset of nepal government website dataset.

What is this prediction use for?

This dataset is used to determine the damage grade (low, medium, destructed) correspond to construction attributes and quality. The objective is to propose a machine learning kernel and predict the damage grade accurately. It can be applied to any country or region with similar construction attribute to predict potential damage grade and further enhance their construction work.

The prediction is useful when performing pre-construction earthquake-proof quality determination, construction enhancement work analysis, and post-earthquake fragility quick analysis. By compare to other earthquakes related prediction such as Earthquake prediction that relies on the seismic signal. They are trying to predict the occurrence probability, location, and magnitude with a time-independent machine learning model to create alert for the public. But it is an extremely difficult task due to extraordinarily stochastic. Another prediction example is LANL prediction, this prediction is used to activate the modern failsafe system and mitigate the earthquake damage by working on acoustic and time-series data to predict the final time to failure. In conclusion, construction attribute-based predictions are valuable as it can determine what construction type is cost-effective in reducing damage and save people from earthquake damage.

Related work Comparison

Below is an example of how to write related work details into a table. You may able to determine what other researcher has done and what is their intention to apply those processes such as focusing on model, feature selection, hyperparameter tuning, or EDA.

Optionally, you may include other research works related to the domain as well. It may provide some hints if the topic doesn’t have too much of related work.

The table above shows the related work summary. It has addressed some of the key points:

Approach with no data sampling receive higher accuracy score
Decision tree score is similar to the Random forest
Parameter tuning is one of the keys to optimize accuracy
Feature selection has lower down accuracy
Random Forest is the most stable kernel

For imbalance class prediction details, you may take a look at here.

Gap Analysis

Base on the earlier work, there are few questions we can address:

Can XGBoost achieve a similar score as LGBM?
Is the data sampling necessary for this dataset?
How does feature selection improve the prediction?

We can draft some related experiments to find out the answer.

Earthquake Damage Prediction with Machine Learning — Part 1

Part 1: Background & Related Work Research

>> Next Section

Written by Ng Jen Neng