Predicting earth quake with Machine Learning

Bhargavi B
Analytics Vidhya
Published in
7 min readMar 29, 2020

Among all the natural calamities, earthquakes are the ones which occur more often and leave high impact in terms of loss in both lives and damage property worth a billion dollars. From one of the surveys on earth quake tolls, we can see the death tolls were in lakhs in a range of few years which stress the destruction the affected areas are facing.

So, if we somehow find a way to predict earthquakes in advance, then we will have a chance to save a lot of lives and infrastructure as well. Before going to think about how to forecast the quakes, lets first understand what earth quake is and how it occurs.

What is an Earthquake?

The top layer of the earth (crust) is called lithosphere and it is not a continuous piece that wraps around the whole Earth like an eggshell. It’s actually made up of giant puzzle pieces called tectonic plates.

These tectonic plates move slowly, but they can get stuck at their edges due to friction. When the stress on the edge overcomes the friction, there is an earthquake that releases energy in waves that travel through the earth’s crust and causes the shaking that we feel.

Thus earthquake is an intense shaking of Earth’s surface. The shaking is caused by movements in Earth’s outermost layer.

Earth Layers

There is a lot of research is going on from past few decades to find a way to predict the earthquake in advance by few hours/days. But it rather seems to be a more complex problem than anticipated. So, let’s see if we can try to find earthquakes by few secs/minutes in advance so that modern power systems can have fail-safe's in place that try to mitigate earthquake damage.

Kaggle has conducted a competition ( https://www.kaggle.com/c/LANL-Earthquake-Prediction.) of this sort and in this blog lets go through all the steps that are involved in a machine learning life cycle for solving this problem.

About the competition

This challenge was hosted by Los Alamos National Laboratory , In this competition, you will address when the earthquake will take place. Specifically, you’ll predict the time remaining, before laboratory earthquakes occur from real-time seismic data.

Data Overview

The data comes from a well-known experimental set-up used to study earthquake physics. We have given train and test datasets. In train data, data is collected from a single continuous experiment which contain multi quakes and the test data consists of a folder containing many small segments. The data within each test file is continuous, but the test files do not represent a continuous segment of the experiment; thus, the predictions cannot be assumed to follow the same regular pattern seen in the training file.

The train dataset contains 600 Million data points with two columns acoustic_data and time_to_failure. acoustic_data is the seismic signal and time remaining (Time to failure, or TTF) until a laboratory earthquake occurs from real-time seismic data

As of now, we know what our business problem is and also have the data collected by lab experiments. Now, we are going to pose it as a ML problem and also defining the performance metric.

Performance Metric

Our objective here is to find the TTF value (which is a continuous value) before the real-time quake occurs. Hence we can pose it as ‘regression’ problem. And main constraint is to have low latency in run-time. The performance metric being used is Mean Absolute Error (MAE) which measures the difference between two continuous variables (in our case the difference between predicted and observed).

Exploratory data analysis

Before starting with Data pre-processing/cleaning, let’s first look at various aspects about the given data. As the given training dataset has 600 Million records, It is not easy to load , visualise the complete data even on the systems with decent configurations like 12GB RAM with i5 processor

One way to deal with this is by just loading 10% data from the dataset and perform EDA , In my case I was always wanted to try GCP(Google Cloud Platform) , so I have created an account and created an instance with 64GB RAM which was more than enough to deal with such huge data.

From the below plot between the acoustic signal and TTF (Train data), we can see that they are total of 16 artificial earth quakes were produced in the training dataset.

In the above plot, we can see that just before the TTF (i.e., earthquake occurs) there is a sudden spike in acoustic signal value. Let’s see the distribution of acoustic signal on overall data.

The distribution is centred around zero, because most of the time excluding the quake occurrences the seismic is around zero.

Feature Engineering

As a step of data validation, I checked if there any null entries or any outliers in the data but was not able to find one. Hence, headed towards feature engineering. The main thing with the given data is that we have a single feature i.e., we have an acoustic signal as input in time series. I mostly worked with text data till now where I always try to reduce number of features to improve the overall performance and this time it’s opposite where we have to generate multiple features given a single one.

Below are some of the important features generated from the given input of around 40 features.

Mel Frequency Cepstral Coefficients (MFCC): These are a set of 20 features which concisely describe the overall shape of a spectral envelope.

Percentiles : Percentile is a way to represent position of a values in data set.

Zero Crossings: This represents no of times a seismic wave is crossings the zero.

No of peaks: As the name suggests, this feature simply counts the number of peaks in a window.

Mean, std: Mean is the average value of seismic signal in given segment and standard deviation is used to measure the spread of the data from the mean.

Skew, Kurtosis: Skewness is used to measure the asymmetry in the acoustic signal and kurtosis measures the tail-heaviness of the distribution.

Fast-Fourier Transform: The Fourier transform is a method of decomposing a signal like my seismic signal into the product of multiple frequencies.

Slope and Intercept: It takes slope and intercept terms of the line using which we tried to fit the segment

Feature Selection

As given the real time data along with the accuracy, low latency is equally important in this case. Hence it is good to remove less important features using the techniques like VIF (variance inflation factor) or by using Models feature importance property.

Here I used feature importance property of Random Forest and have considered top 12 features for further modelling.

Modelling

In general choosing the right model that compliments the business constraints is equally important. In our case as we have segmented the data and generated single observation from set of 1,50,000 records, we have total of ~5000 records as train dataset and ~2600 records for test dataset.

As the final train dataset is not too large, we can start with simple models like SVR and linear regression to avoid over fitting issue.

SVR
It is a regression algorithm to find the best fit line is the line hyperplane that has maximum number of points.

Also, tried some complex models like random forest, GBDT etc., as the data is not too short.

RF
This is a form of ensemble model (Bagging), which bootstrapped aggregation in the modelling

GBDT
Another form of ensemble technique (Boosting) were we use additive combining of the models to correct the errors from the previous model

Conclusion

From all the above models by using the hyper parameter tuning, it is found that Random Forest is yielding better results.

Mean Absolute Error scores

Future Work

In order to further improve the performance of the models, we can try using complex models like neural networks i.e., LSTM’s etc and also we can include automatic extraction of relevant features from tsfresh library

References

https://tsfresh.readthedocs.io/en/latest/text/list_of_features.htm

https://spaceplace.nasa.gov/earthquakes/en

--

--