Earthquake Damage Prediction with Machine Learning — Part 1

Ng Jen Neng
Analytics Vidhya
Published in
3 min readNov 20, 2020

By Jen Neng Ng

Photo by Jorge Fernández Salas on Unsplash

Machine learning is a tool to help data scientists perform prediction. In order to fully understand how to analyse a dataset in a given case scenario, inspecting the data distribution and drafting a comprehensive experiment plan is more toward delivering a better outcome.

As Kaggle has become the most favourite platform for a data scientist to learn from each other, many beginners will be stuck at the point on what to analyze and what to experiment with.

In this tutorial, I am going to assume you have some basic understanding of Machine learning from some online courses or coding background. As most of the online courses scenario is lacking data science view and straight to the coding, I will demonstrate a simple data science observation view in this article.

The basic flow of developing a machine learning algorithm is :

Part 1: Background and Related Work Research

  1. Understand the domain and objective
  2. Analyse and compare related work from different researcher

Part 2: Data Analysis

  1. Select experiment programming language, model and validation method
  2. Create a exploration data analysis (EDA) report

Part 3 & Part 4: Implementation

  1. Data preprocessing
  2. Draft experiment plan
  3. Implementation with regard to experiments
  4. Summarise accuracy score for different experiment
  5. Understand the trade-off point
  6. Submit solution

Environment Setup

Language: R

IDE: R Studio

Model: Decision Tree, Random Forest, XGBoost

Package: Caret, ggplot, UBL, rpart, psych, MLmetrics 1.1.3

Dataset: https://www.drivendata.org/competitions/57/nepal-earthquake/data/

Part 1: Background & Related Work Research

This case study is focusing on the competition and Gorkha (Nepal) earthquake dataset in DataDriven.org and Kaggle. The competition requires F1 Micro score as the performance metric. Also, this dataset is an abstract version dataset of nepal government website dataset.

What is this prediction use for?

This dataset is used to determine the damage grade (low, medium, destructed) correspond to construction attributes and quality. The objective is to propose a machine learning kernel and predict the damage grade accurately. It can be applied to any country or region with similar construction attribute to predict potential damage grade and further enhance their construction work.

The prediction is useful when performing pre-construction earthquake-proof quality determination, construction enhancement work analysis, and post-earthquake fragility quick analysis. By compare to other earthquakes related prediction such as Earthquake prediction that relies on the seismic signal. They are trying to predict the occurrence probability, location, and magnitude with a time-independent machine learning model to create alert for the public. But it is an extremely difficult task due to extraordinarily stochastic. Another prediction example is LANL prediction, this prediction is used to activate the modern failsafe system and mitigate the earthquake damage by working on acoustic and time-series data to predict the final time to failure. In conclusion, construction attribute-based predictions are valuable as it can determine what construction type is cost-effective in reducing damage and save people from earthquake damage.

Related work Comparison

Below is an example of how to write related work details into a table. You may able to determine what other researcher has done and what is their intention to apply those processes such as focusing on model, feature selection, hyperparameter tuning, or EDA.

Optionally, you may include other research works related to the domain as well. It may provide some hints if the topic doesn’t have too much of related work.

The table above shows the related work summary. It has addressed some of the key points:

  1. Approach with no data sampling receive higher accuracy score
  2. Decision tree score is similar to the Random forest
  3. Parameter tuning is one of the keys to optimize accuracy
  4. Feature selection has lower down accuracy
  5. Random Forest is the most stable kernel

For imbalance class prediction details, you may take a look at here.

Gap Analysis

Base on the earlier work, there are few questions we can address:

  1. Can XGBoost achieve a similar score as LGBM?
  2. Is the data sampling necessary for this dataset?
  3. How does feature selection improve the prediction?

We can draft some related experiments to find out the answer.

>> Next Section

--

--