Detecting Potentially Hazardous Asteroids using Deep Learning (Part 1)

Published in

Analytics Vidhya

6 min readAug 7, 2020

Witness the power of deep learning on huge datasets!

Asteroids are small, rocky objects that orbit the Sun. Although they orbit the Sun just like the planets, they are much smaller than them. Potentially Hazardous Asteroid(PHA) is a near-Earth object — either an asteroid or a comet — with an orbit that can make close approaches to the Earth and large enough to cause significant regional damage in the event of impact.

I will explain how to detect potentially hazardous asteroids initially using traditional machine learning classifiers and later using artificial neural network to draw a comparison between both. For this purpose I have split this article into 2 parts-
Part 1- Hazardous asteroid detection by traditional classifiers
Part 2- Hazardous asteroid detection by building a neural network (https://medium.com/@jatin.kataria94/detecting-potentially-hazardous-asteroids-using-deep-learning-part-2-b3bfd1e6774c)

Traditional ML Classification

For understanding how to do a traditional classification, hyperparameter tuning and visualization kindly refer to my previous article:

Pulsars Detection, Hyperparameter Tuning and Visualization

Explore the power of Yellowbrick and mlxtend!

medium.com

If you have understood from my article about how to carry out classification using traditional approach you can understand the following results:

For the source code please visit the following link:

jatinkataria94/Asteroid-Detection

Contribute to jatinkataria94/Asteroid-Detection development by creating an account on GitHub.

github.com

For accessing dataset, visit the following link:

Asteroid Dataset

NASA JPL Asteroid Dataset

www.kaggle.com

Data Description

The data looks like this: 
   neo pha     H  diameter  ...      sigma_tp     sigma_per class      rms
0   N   0  3.40   939.400  ...  3.782900e-08  9.415900e-09   MBA  0.43301
1   N   0  4.20   545.000  ...  4.078700e-05  3.680700e-06   MBA  0.35936
2   N   0  5.33   246.596  ...  3.528800e-05  3.107200e-06   MBA  0.33848
3   N   0  3.00   525.400  ...  4.103700e-06  1.274900e-06   MBA  0.39980
4   N   0  6.90   106.699  ...  3.474300e-05  3.490500e-06   MBA  0.52191[5 rows x 38 columns]The shape of data is:  (958524, 38)The missing values in data are: 
 albedo            823421
diameter_sigma    822443
diameter          822315
sigma_per          19926
sigma_ad           19926
sigma_q            19922
sigma_e            19922
sigma_a            19922
sigma_i            19922
sigma_om           19922
sigma_w            19922
sigma_ma           19922
sigma_n            19922
sigma_tp           19922
moid               19921
pha                19921
H                   6263
moid_ld              127
neo                    4
per                    4
ad                     4
rms                    2
ma                     1
per_y                  1
w                      0
om                     0
i                      0
q                      0
a                      0
e                      0
epoch_cal              0
epoch_mjd              0
epoch                  0
orbit_id               0
class                  0
tp                     0
tp_cal                 0
n                      0
dtype: int64The summary of data is: 
                    H       diameter  ...     sigma_per            rms
count  952261.000000  136209.000000  ...  9.385980e+05  958522.000000
mean       16.906411       5.506429  ...  8.525815e+04       0.561153
std         1.790405       9.425164  ...  2.767681e+07       2.745700
min        -1.100000       0.002500  ...  9.415900e-09       0.000000
25%        16.100000       2.780000  ...  1.794500e-05       0.518040
50%        16.900000       3.972000  ...  3.501700e-05       0.566280
75%        17.714000       5.765000  ...  9.775475e-05       0.613927
max        33.200000     939.400000  ...  1.910700e+10    2686.600000[8 rows x 34 columns]Some useful data information:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958524 entries, 0 to 958523
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   neo             958520 non-null  object 
 1   pha             938603 non-null  object 
 2   H               952261 non-null  float64
 3   diameter        136209 non-null  float64
 4   albedo          135103 non-null  float64
 5   diameter_sigma  136081 non-null  float64
 6   orbit_id        958524 non-null  object 
 7   epoch           958524 non-null  float64
 8   epoch_mjd       958524 non-null  int64  
 9   epoch_cal       958524 non-null  float64
 10  e               958524 non-null  float64
 11  a               958524 non-null  float64
 12  q               958524 non-null  float64
 13  i               958524 non-null  float64
 14  om              958524 non-null  float64
 15  w               958524 non-null  float64
 16  ma              958523 non-null  float64
 17  ad              958520 non-null  float64
 18  n               958524 non-null  float64
 19  tp              958524 non-null  float64
 20  tp_cal          958524 non-null  float64
 21  per             958520 non-null  float64
 22  per_y           958523 non-null  float64
 23  moid            938603 non-null  float64
 24  moid_ld         958397 non-null  float64
 25  sigma_e         938602 non-null  float64
 26  sigma_a         938602 non-null  float64
 27  sigma_q         938602 non-null  float64
 28  sigma_i         938602 non-null  float64
 29  sigma_om        938602 non-null  float64
 30  sigma_w         938602 non-null  float64
 31  sigma_ma        938602 non-null  float64
 32  sigma_ad        938598 non-null  float64
 33  sigma_n         938602 non-null  float64
 34  sigma_tp        938602 non-null  float64
 35  sigma_per       938598 non-null  float64
 36  class           958524 non-null  object 
 37  rms             958522 non-null  float64
dtypes: float64(33), int64(1), object(4)
memory usage: 277.9+ MB
NoneThe columns in data are: 
 ['neo' 'pha' 'H' 'diameter' 'albedo' 'diameter_sigma' 'orbit_id' 'epoch'
 'epoch_mjd' 'epoch_cal' 'e' 'a' 'q' 'i' 'om' 'w' 'ma' 'ad' 'n' 'tp'
 'tp_cal' 'per' 'per_y' 'moid' 'moid_ld' 'sigma_e' 'sigma_a' 'sigma_q'
 'sigma_i' 'sigma_om' 'sigma_w' 'sigma_ma' 'sigma_ad' 'sigma_n' 'sigma_tp'
 'sigma_per' 'class' 'rms']The target variable is divided into: 
 0    930269
1      2066
Name: pha, dtype: int64The numerical features are: 
 ['pha', 'H', 'epoch', 'epoch_mjd', 'epoch_cal', 'e', 'a', 'q', 'i', 'om', 'w', 'ma', 'ad', 'n', 'tp', 'tp_cal', 'per', 'per_y', 'moid', 'moid_ld', 'sigma_e', 'sigma_a', 'sigma_q', 'sigma_i', 'sigma_om', 'sigma_w', 'sigma_ma', 'sigma_ad', 'sigma_n', 'sigma_tp', 'sigma_per', 'rms']The categorical features are: 
 ['neo', 'orbit_id', 'class']The categorical variable is divided into: 
 N    909452
Y     22883
Name: neo, dtype: int64The categorical variable is divided into: 
 1          50142
JPL 1      47494
JPL 2      34563
JPL 3      29905
12         29136
 
JPL 453        1
204            1
JPL 480        1
JPL 528        1
241            1
Name: orbit_id, Length: 525, dtype: int64The categorical variable orbit_id has too many divisions to plotThe categorical variable is divided into: 
 MBA    832650
OMB     27170
IMB     19702
MCA     17789
APO     12684
AMO      8448
TJN      8122
TNO      3459
ATE      1729
CEN       503
AST        57
IEO        22
Name: class, dtype: int64Execution Time for EDA: 7.86 minutes

Observations

Looking at the above results, we can clearly see that it is a huge dataset with nearly a million entries (958,524)!
Imbalanced class
Three categorical features
Large number of null values in few features

Results

Plot of feature importance used for selecting features

Heatmaps of confusion matrix and classification reports for imbalanced data

Heatmaps of confusion matrix and classification reports for balanced data

Plot of decision region

Performance of traditional classifier

Imbalanced data

Accuracy — 99.86%
Precision — 95.6%
Recall — 39.2%

Balanced data

Accuracy — 99.27%
Precision — 23.2%
Recall — 100%

The model after rectifying for class imbalance recalls the hazardous asteroids with only 23.2% precision which implies that a large number of non-hazardous asteroids were misclassified as hazardous. Such a situation can lead to panic and fear among the masses if every time a non-hazardous asteroid is flagged as potentially hazardous.

We will try to improve the precision using deep learning model.

Deep Learning

Deep learning deals with algorithms inspired by the structure and function of the brain called artificial neural networks. Deep learning models tend to perform well with large amount of data whereas old machine learning models stop improving after a saturation point.

In Part 2, we will look at how to build a deep learning artificial neural network model and compare its performance with that of the traditional classifier used in this part.

Detecting Potentially Hazardous Asteroids using Deep Learning (Part 2)

Witness the power of deep learning on huge datasets!

medium.com

Detecting Potentially Hazardous Asteroids using Deep Learning (Part 1)

Traditional ML Classification

Pulsars Detection, Hyperparameter Tuning and Visualization

Explore the power of Yellowbrick and mlxtend!

jatinkataria94/Asteroid-Detection

Contribute to jatinkataria94/Asteroid-Detection development by creating an account on GitHub.

Asteroid Dataset

NASA JPL Asteroid Dataset

Data Description

Observations

Results

Performance of traditional classifier

Deep Learning

Detecting Potentially Hazardous Asteroids using Deep Learning (Part 2)

Witness the power of deep learning on huge datasets!

Written by Jatin Kataria