Earthquakes Classification Model

Pranav Vijay
INST414: Data Science Techniques
3 min readMay 2, 2024

Question: What earthquakes will be classified as “red alert”?

The stakeholder asking this question is an emergency personnel who has to alert others if there is an earthquake that hits the region.

The decision the stakeholder will make after finding the answer to this question is making a good judgment call as to what alert label will be used to classify certain features of a future earthquake.

The data that can answer this question is data from all past earthquakes, which would contain fields about the magnitude, tsunami occurrence, alert label, significance of earthquake, depth of earthquake, maximum reported intensity, and maximum estimated instrumental intensity. This is relevant to my question since this data will help me create a model on how alerts are chosen based on the features of an earthquake.

I used the Kaggle website to collect a subset of this data. Kaggle contains many datasets that are available for downloading. The dataset was called “Earthquake Dataset” which was uploaded by Chirag Chauhan. There were 2 CSV files as part of the dataset. I chose to focus on the “earthquake_data.csv” file. I created a Jupyter notebook to create a program to analyze the data and create a classification model. I used a Python kernel to run my program. I downloaded the CSV file of the dataset from Kaggle and read it in Python through a dataframe from the Pandas library using the read_csv() function.

I am using a classification model for this analysis since I am assigning a label for each earthquake. The feature I am predicting is categorical.

The features I am using for my classification model are the magnitude, tsunami occurrence, significance, depth, maximum reported intensity, and maximum estimated instrumental intensity of the earthquake.

I cleaned up my data by making sure that no values were missing from the data. After reviewing the dataset, there were rows that were missing data. To solve this issue, I used the dropna() function to drop the rows that contain null values. I also reviewed my data to see if there were duplicate values in the dataset. After reviewing, there were no duplicate values present. I also removed any columns that weren’t needed in my current data analysis. In my dataframe, I removed the “title,” “date_time,” “magType,” “location,” “continent,” “country,” “net,” “gap,” “dmin,” “nst,” “latitude,” and “longitude” columns since these weren’t features that I wanted to focus on for my classification model.

One limitation of my analysis is that a lot of rows in the dataset did not have values for alerts and were dropped from my dataframe. Since there aren’t alert labels for these rows, I am missing out on training more data that can improve my model. I am missing data about earthquakes from before 2001 and after 2022. This may be biased since I am assuming that the location of the earthquakes does not play a significant part in how strong the earthquake is and the alert label associated with the earthquake.

Here is a link to my GitHub repository that contains the Jupyter notebook that I used to load the data. The GitHub repository also contains the CSV file of the original dataset that I analyzed.

Link: https://github.com/pvijay2024/module6

Here is a link to the original Kaggle dataset I used to create a classification model.

Link: https://www.kaggle.com/datasets/warcoder/earthquake-dataset?select=earthquake_data.csv

--

--