Detecting Potentially Hazardous Asteroids using Deep Learning (Part 1)

Jatin Kataria
Analytics Vidhya
Published in
6 min readAug 7, 2020

Witness the power of deep learning on huge datasets!

Asteroids are small, rocky objects that orbit the Sun. Although they orbit the Sun just like the planets, they are much smaller than them. Potentially Hazardous Asteroid(PHA) is a near-Earth object — either an asteroid or a comet — with an orbit that can make close approaches to the Earth and large enough to cause significant regional damage in the event of impact.

I will explain how to detect potentially hazardous asteroids initially using traditional machine learning classifiers and later using artificial neural network to draw a comparison between both. For this purpose I have split this article into 2 parts-

Part 1- Hazardous asteroid detection by traditional classifiers

Part 2- Hazardous asteroid detection by building a neural network (https://medium.com/@jatin.kataria94/detecting-potentially-hazardous-asteroids-using-deep-learning-part-2-b3bfd1e6774c)

Traditional ML Classification

For understanding how to do a traditional classification, hyperparameter tuning and visualization kindly refer to my previous article:

If you have understood from my article about how to carry out classification using traditional approach you can understand the following results:

For the source code please visit the following link:

For accessing dataset, visit the following link:

Data Description

The data looks like this: 
neo pha H diameter ... sigma_tp sigma_per class rms
0 N 0 3.40 939.400 ... 3.782900e-08 9.415900e-09 MBA 0.43301
1 N 0 4.20 545.000 ... 4.078700e-05 3.680700e-06 MBA 0.35936
2 N 0 5.33 246.596 ... 3.528800e-05 3.107200e-06 MBA 0.33848
3 N 0 3.00 525.400 ... 4.103700e-06 1.274900e-06 MBA 0.39980
4 N 0 6.90 106.699 ... 3.474300e-05 3.490500e-06 MBA 0.52191
[5 rows x 38 columns]The shape of data is: (958524, 38)The missing values in data are:
albedo 823421
diameter_sigma 822443
diameter 822315
sigma_per 19926
sigma_ad 19926
sigma_q 19922
sigma_e 19922
sigma_a 19922
sigma_i 19922
sigma_om 19922
sigma_w 19922
sigma_ma 19922
sigma_n 19922
sigma_tp 19922
moid 19921
pha 19921
H 6263
moid_ld 127
neo 4
per 4
ad 4
rms 2
ma 1
per_y 1
w 0
om 0
i 0
q 0
a 0
e 0
epoch_cal 0
epoch_mjd 0
epoch 0
orbit_id 0
class 0
tp 0
tp_cal 0
n 0
dtype: int64
The summary of data is:
H diameter ... sigma_per rms
count 952261.000000 136209.000000 ... 9.385980e+05 958522.000000
mean 16.906411 5.506429 ... 8.525815e+04 0.561153
std 1.790405 9.425164 ... 2.767681e+07 2.745700
min -1.100000 0.002500 ... 9.415900e-09 0.000000
25% 16.100000 2.780000 ... 1.794500e-05 0.518040
50% 16.900000 3.972000 ... 3.501700e-05 0.566280
75% 17.714000 5.765000 ... 9.775475e-05 0.613927
max 33.200000 939.400000 ... 1.910700e+10 2686.600000
[8 rows x 34 columns]Some useful data information:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958524 entries, 0 to 958523
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 neo 958520 non-null object
1 pha 938603 non-null object
2 H 952261 non-null float64
3 diameter 136209 non-null float64
4 albedo 135103 non-null float64
5 diameter_sigma 136081 non-null float64
6 orbit_id 958524 non-null object
7 epoch 958524 non-null float64
8 epoch_mjd 958524 non-null int64
9 epoch_cal 958524 non-null float64
10 e 958524 non-null float64
11 a 958524 non-null float64
12 q 958524 non-null float64
13 i 958524 non-null float64
14 om 958524 non-null float64
15 w 958524 non-null float64
16 ma 958523 non-null float64
17 ad 958520 non-null float64
18 n 958524 non-null float64
19 tp 958524 non-null float64
20 tp_cal 958524 non-null float64
21 per 958520 non-null float64
22 per_y 958523 non-null float64
23 moid 938603 non-null float64
24 moid_ld 958397 non-null float64
25 sigma_e 938602 non-null float64
26 sigma_a 938602 non-null float64
27 sigma_q 938602 non-null float64
28 sigma_i 938602 non-null float64
29 sigma_om 938602 non-null float64
30 sigma_w 938602 non-null float64
31 sigma_ma 938602 non-null float64
32 sigma_ad 938598 non-null float64
33 sigma_n 938602 non-null float64
34 sigma_tp 938602 non-null float64
35 sigma_per 938598 non-null float64
36 class 958524 non-null object
37 rms 958522 non-null float64
dtypes: float64(33), int64(1), object(4)
memory usage: 277.9+ MB
None
The columns in data are:
['neo' 'pha' 'H' 'diameter' 'albedo' 'diameter_sigma' 'orbit_id' 'epoch'
'epoch_mjd' 'epoch_cal' 'e' 'a' 'q' 'i' 'om' 'w' 'ma' 'ad' 'n' 'tp'
'tp_cal' 'per' 'per_y' 'moid' 'moid_ld' 'sigma_e' 'sigma_a' 'sigma_q'
'sigma_i' 'sigma_om' 'sigma_w' 'sigma_ma' 'sigma_ad' 'sigma_n' 'sigma_tp'
'sigma_per' 'class' 'rms']
The target variable is divided into:
0 930269
1 2066
Name: pha, dtype: int64
The numerical features are:
['pha', 'H', 'epoch', 'epoch_mjd', 'epoch_cal', 'e', 'a', 'q', 'i', 'om', 'w', 'ma', 'ad', 'n', 'tp', 'tp_cal', 'per', 'per_y', 'moid', 'moid_ld', 'sigma_e', 'sigma_a', 'sigma_q', 'sigma_i', 'sigma_om', 'sigma_w', 'sigma_ma', 'sigma_ad', 'sigma_n', 'sigma_tp', 'sigma_per', 'rms']
The categorical features are:
['neo', 'orbit_id', 'class']
The categorical variable is divided into:
N 909452
Y 22883
Name: neo, dtype: int64
The categorical variable is divided into:
1 50142
JPL 1 47494
JPL 2 34563
JPL 3 29905
12 29136

JPL 453 1
204 1
JPL 480 1
JPL 528 1
241 1
Name: orbit_id, Length: 525, dtype: int64
The categorical variable orbit_id has too many divisions to plotThe categorical variable is divided into:
MBA 832650
OMB 27170
IMB 19702
MCA 17789
APO 12684
AMO 8448
TJN 8122
TNO 3459
ATE 1729
CEN 503
AST 57
IEO 22
Name: class, dtype: int64
Execution Time for EDA: 7.86 minutes

Observations

  • Looking at the above results, we can clearly see that it is a huge dataset with nearly a million entries (958,524)!
  • Imbalanced class
  • Three categorical features
  • Large number of null values in few features

Results

Plot of feature importance used for selecting features

Heatmaps of confusion matrix and classification reports for imbalanced data

Heatmaps of confusion matrix and classification reports for balanced data

Plot of decision region

Performance of traditional classifier

Imbalanced data

  • Accuracy — 99.86%
  • Precision — 95.6%
  • Recall — 39.2%

Balanced data

  • Accuracy — 99.27%
  • Precision — 23.2%
  • Recall — 100%

The model after rectifying for class imbalance recalls the hazardous asteroids with only 23.2% precision which implies that a large number of non-hazardous asteroids were misclassified as hazardous. Such a situation can lead to panic and fear among the masses if every time a non-hazardous asteroid is flagged as potentially hazardous.

We will try to improve the precision using deep learning model.

Deep Learning

Deep learning deals with algorithms inspired by the structure and function of the brain called artificial neural networks. Deep learning models tend to perform well with large amount of data whereas old machine learning models stop improving after a saturation point.

In Part 2, we will look at how to build a deep learning artificial neural network model and compare its performance with that of the traditional classifier used in this part.

--

--