Earthquake History (1965 — 2016): Data Visualization and Model Development

Sameer
Analytics Vidhya
Published in
6 min readOct 23, 2019
Photo by Andrew Buchanan on Unsplash

Implementation of RandomForestRegressor and finding the best fit parameters using GridSearchCV to predict the occurrences based on the country selected.

Note: I have developed the model, inspired from this notebook, and modified a lot with my programing skills and creativity.

Step by step procedure

  1. Download the data from here.
  2. If we notice carefully the data does not contain Place column, which is very important to segregate it via country. But it does contain latitude and longitude coordinate values.
  3. To generate place column, we use the API of openweathermap.org and pass the latitude and longitude values to finally return the place name (You can use Reverse geocoding methods).
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> data_df = pd.read_csv('quake_db_1965-2016.csv')
>>> ## after geocoding implementation ##
>>> print(data_df['Place'])
>>> ["Ngidihopitonu, ID", "Pangai, TO", "Unknown", "Dapdap, PH" ... "Namie, JP"]

Each value is separated by “,” having place name and country code. And some are Unknown as the API could not find the location and hence it is validated as Unknown. JP stands for Japan, ID stands for Indonesia etc.

4. Get the count of the earthquake as per the country code. This can be easily implemented and visualized by pandas value_counts() function.

>>> data_df = data_df[data_df['Place'] != 'Unknown']
>>> data_df['CountryCode'] = data_df['Place'].apply(lambda x : x.split(', ')[len(x.split(', ')) - 1])
>>> top_thirty = data_df['CountryCode'].value_counts()
>>> top_thirty = top_thirty.nlargest(30).index
>>> top_thirty = list(top_thirty)
['ID', 'PG', 'JP', 'PH', 'VU', 'CL', 'RU', 'US', 'SB', 'TO', 'MX', 'PE', 'CN', 'FJ', 'AR', 'TW', 'IR', 'AF', 'NZ', 'GR', 'GT', 'CO', 'TR', 'NI', 'KZ', 'EC', 'PK', 'IN', 'PA', 'MM']
>>> df_update = data_df['CountryCode'].where(data_df['CountryCode'].isin(top_thirty), other='Other')>>> df_update.value_counts().plot(kind='bar', figsize=(15, 8))
Total Earthquake by country

Let’s do some data visualization

For this, I have considered only top_thirty affected countries.

Create a RangeSlider displaying the scaled years from 1965 — 2016

import numpy as np
import dash
import dash_core_components as dcc
import dash_html_components as html
year_marks = np.linspace(1965, 2016, 15)
year_marks = [int(i) for i in year_marks]
range_slider = html.Div([
dcc.Slider(
id='year-slider',
min=min(year_marks),
max=max(year_marks),
step=1,
marks={i : '{}'.format(i) for i in year_marks},
value=year_marks[0]
)
])

Make a drop down showing the top_thirty affected countries list

country_dropdown = html.Div([
dcc.Dropdown(
id='top-thirty-risky',
options=[
{'label' : 'Indonesia', 'value' : 'ID'},
{'label' : 'Japan', 'value' : 'JP'},
{'label' : 'India', 'value' : 'IN'}
...
{'label' : 'Myanmar', 'value' : 'MM'}
],
clearable=False,
searchable=True,
value='ID'
)
])

Output map to show the country wise and yearly wise result on the map

country_map = html.Div(id='map-history')

Use a helper function in order to extract the data based on year and country code.

def GetCountryDataByYear(risky_code, year_value):
dataq = pd.read_csv('quake_db_1965-2016.csv')
risky_country = dataq[dataq['Place'].str.contains(risky_code)]
risky_country = risky_country[['Date', 'Latitude', 'Longitude', 'Magnitude', 'Depth', 'Type', 'Place']]
_, _, risky_country['Year'] = risky_country['Date'].str.split('/').str
risky_country.loc[:, 'Year'] = risky_country.loc[:, 'Year'].astype(int)
risky_country = GetDataYearValue(risky_country, year_value)
return risky_country

Display the output in the dash application

import plotly.graph_objs as goinsightful_history = html.Div([
html.Div([]),
country_dropdown,
country_map,
range_slider,
])
app = dash.Dash(__name__)#<top_thirty_countries_occurrences_by_year>
@app.callback(
Output('map-history', 'children'),
[Input('top-thirty-risky', 'value'), Input('year-slider', 'value')]
)
def history_scatter_map(risky_code, year_value):
risky_country = GetCountryDataByYear(risky_code, year_value)
if risky_country.shape[0] > 0:
lats = risky_country['Latitude'].to_list()
lons = risky_country['Longitude'].to_list()
magnitudes = risky_country['Magnitude'].to_list()
mags_info = ['Magnitude : ' + str(m) for m in magnitudes]
depths = risky_country['Depth'].to_list()
deps_info = ['Depth : ' + str(d) for d in depths]
places = risky_country['Place'].to_list()
country_risky_info = [places[i] + '<br>' + mags_info[i] + '<br>' + deps_info[i]
for i in range(risky_country.shape[0])]
center_lat = risky_country[risky_country['Magnitude'] <= risky_country['Magnitude'].min()]['Latitude'].to_list()[0]
center_lon = risky_country[risky_country['Magnitude'] <= risky_country['Magnitude'].min()]['Longitude'].to_list()[0]
country_map = PlotScatterMap(lats, lons, 10, magnitudes, default_colorscale, country_risky_info)
country_layout = LayoutScatter(400, 1000, 'stamen-terrain', center_lat, center_lon, 2.5)
result_country = html.Div([
dcc.Graph(
id='risky-country-result',
figure={'data' : [country_map], 'layout' : country_layout}
)
], style={'margin-top' : 20, 'margin-left' : 10})
return result_country
return html.Div([
html.H6('No Earthquakes found for {} in the year {}'.format(risky_code, year_value))
], style={'margin-top' : 150, 'margin-bottom' : 150, 'margin-left' : 250})
#</top_thirty_countries_occurrences_by_year>

Finally run the application

if __name__ == '__main__':
app.run_server(debug=True, dev_tools_props_check=False, dev_tools_ui=False)

Results

Besides just visualization, I have also done some basic statistics that when selected a particular year and the country, displays the total number of occurrences, yearly total occurrences, highest magnitude, depth pertaining to the highest magnitude and the actual place name as per the coordinates.

Country Map result of Indonesia for the year 1965.

Indonesia earthquakes in 1965

Country Map result of Japan for the year 2011 (we all know that it was so DISASTROUS).

Japan earthquakes in 2011

Predictions for Japan

For this, we consider only Japan data alone.

>>> qdb = pd.read_csv('quake_db_1965-2016.csv')
>>> japan = qdb[qdb['Place'].str.contains('JP')]
>>> japan.shape
(1347, 22)

To avoid NaN values and interpolate those values, we require some helper functions.

def nan_helper(y):    
return np.isnan(y), lambda z: z.nonzero()[0]
def get_interpolation(my_df, nan_series):
arr_series = np.array(my_df[str(nan_series)])
nans, x = nan_helper(arr_series)
arr_series[nans] = np.interp(x(nans), x(~nans), arr_series[~nans])
return arr_series.round(2)

To convert the NaN into actual values, call the function.

japan.loc[:, 'Depth Error'] = get_interpolation(japan, 'Depth Error')
japan.loc[:, 'Depth Seismic Stations'] = get_interpolation(japan, 'Depth Seismic Stations')
japan.loc[:, 'Magnitude Error'] = get_interpolation(japan, 'Magnitude Error')
japan.loc[:, 'Magnitude Seismic Stations'] = get_interpolation(japan, 'Magnitude Seismic Stations')
japan.loc[:, 'Azimuthal Gap'] = get_interpolation(japan, 'Azimuthal Gap')
japan.loc[:, 'Horizontal Distance'] = get_interpolation(japan, 'Horizontal Distance')
japan.loc[:, 'Horizontal Error'] = get_interpolation(japan, 'Horizontal Error')
japan.loc[:, 'Root Mean Square'] = get_interpolation(japan, 'Root Mean Square')

Our model supports only integer values. We need to convert the string data into numeric values. For that we have to use LabelEncoder() class.

def label_integer_encoder(my_df, series_name):
arr_name = np.array(list(my_df[str(series_name)]))
label_arr_encoder = LabelEncoder()
integer_arr_encoded = label_arr_encoder.fit_transform(arr_name)
return integer_arr_encodedjapan.loc[:, 'Type'] = label_integer_encoder(japan, 'Type')
japan.loc[:, 'Status'] = label_integer_encoder(japan, 'Status')
japan.loc[:, 'Magnitude Type'] = label_integer_encoder(japan, 'Magnitude Type')
japan.loc[:, 'Place'] = label_integer_encoder(japan, 'Place')

Now split the data into features and targets.

X = japan[['Depth', 'Magnitude Error', 'Magnitude Type', 'Depth Error', 'Azimuthal Gap', 'Horizontal Distance', 'Horizontal Error', 'Root Mean Square', 'Place']]y = japan[['Latitude', 'Longitude', 'Magnitude']

Our targets are the coordinates values and Magnitude, to locate the points on the map.

>>> X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.2, random_state=5)>>> X_train.shape
(1077, 9)
>>> y_train.shape
(1077, 3)
>>> X_test.shape
(270, 9)
>>> y_test.shape
(270, 3)

Model Development

RandomForestRegressor is used for classification and gives higher accuracy for a large proportion of data.

>>> reg = RandomForestRegressor()
>>> reg.fit(X_train, y_train)
>>> preds = reg.predict(X_test)
>>> accuracy = reg.score(X_test, y_test)
>>> accuracy
0.7162181047909523

GridSearchCV is the best algorithm for determining the hyper parameters to fit the data and give better results. We just need to define a bunch of parameters and pass in the above model. Check this for more information.

>>> parameters = {'n_estimators' : [16, 23, 43, 87, 45, 550, 680]}
>>> gs = GridSearchCV(reg, parameters, cv=10)
>>> grid_fit = gs.fit(X_train, y_train)
>>> best_fit = grid_fit.best_estimator_
>>> y_hat = best_fit.predict(X_test)
>>> gs_a = best_fit.score(X_test, y_test)
>>> print(gs_a)
0.7900631674945803

Let’s make the result (y_hat) as data frame in order to visualize the predictions in a more handy way.

>>> preds_df = pd.DataFrame()
>>> preds_df['lats'] = [row[0] for row in y_hat]
>>> preds_df['lons'] = [row[1] for row in y_hat]
>>> preds_df['mags'] = [round(row[2], 2) for row in y_hat]
>>> preds_df.head()
lats lons mags 37.621949 141.460630 5.72
39.431155 143.188569 6.72
36.210025 141.240586 6.04
35.908162 139.069894 6.00
36.863911 140.763899 5.85
Japan predictions

Similarly we can implement the predictions for other countries. I have used Japan because it is the next highly prone country to earthquakes after Indonesia.

Full code for this project is here.

Conclusion

  1. Earthquake is a natural disaster that is so random and very difficult to predict the next actual occurrence.
  2. To be able to predict the earthquakes we need to have the real time tectonic shift movement data that is when ruptured causes earthquake.
  3. Sequence mining will be very helpful to predict the next occurrence that happens by some pattern of the sequence.

--

--