Classifying Asteroids Using ML : A beginner’s tale (Part 1)

Tarushi Pathak
Analytics Vidhya
Published in
4 min readJul 12, 2020

Hello Machine Learning Enthusiast ! This is a two-part beginner level tutorial on machine learning’s classification algorithm. So, without further any delay, let’s plunge right into it.

About the Dataset

The data we’ll be using for this tutorial is original data about asteroids , provided by NASA. It has 50 columns , one of which classifies whether the asteroid is hazardous or not. It is a common practice in such tutorials,to explain the features (columns) of the data but I won’t be doing so ,neither will I be asking you to research about it. I want you to just work with the data at hand and to discover the insights which it may have. This way you’ll feel the beauty of Data Analytics ,by yourself. You can find the dataset here.

Importing Libraries and Uploading the CSV

From the above pic , you can see that I have imported the pandas,numpy,seaborn and matplotlib libraries.

Once done importing the libraries, the next step is uploading the csv file.It is usually done by using the pandas’ read_csv function and is followed by storing the uploaded file in a variable(nasa_df) .In order to see the contents of the file , we have nasa_df.head() to show the file up till five rows.

Exploratory Data Analysis

Once we are done with uploading the file, the next step is to perform Exploratory Data Analysis. Don’t get scared by this fancy term.It simply means,to draw insight from the data, if required normalize it and drop the terms which are not contributing much to the variable we want to predict.

For this dataset , I will be covering the following topics for analysis:

  • Label Encoding
  • Resampling
  • Dropping certain features
  • Correlation Heatmap

Label Encoding

Label encoding is generally performed to convert the labels into numeric form such that they are easily interpreted by the machine. In our dataset, we have the Hazardous column with labels true and false. We’ll be using Label Encoding for the same.

The method that allows us to do so is sklearn’s LabelEncoder. The function value_counts returns the count of unique values in the particular column.From the values returned we can see that the data is highly imbalanced as there are more number of 0’s than 1. For this we’ll be performing resampling.

Resampling

Resampling is performed when we have an imbalanced dataset.It can be done in many ways but for this tutorial, I am using the one I feel is the easiest and fit to the deal with the problem at hand.

We simply separated all the features corresponding to two labels into two different dataframes.Then we concatenated the two dataframes by limiting the number of values we are putting in case of the label 0. After this the distribution becomes balanced.

Dropping certain Features

There are a number of reasons ,apart from the features not contributing much to the target variables, for which a feature can be dropped.

As can be seen above, we have dropped most of the values as they are the same values stored in different forms. We will be using kilometers as our metric for measurement.

Next we drop the features which have just one unique value throughout the column. In this case , the features are Orbiting Body and Equinox.

After dropping most of the values in a similar manner, the next step we will perform is plotting a correlation heatmap. Read about it a bit.

I’ll be covering the correlation heatmap , normalizing values and implementing the machine learning algorithm in part 2 .

Go to part 2 by clicking here.

Hope you had fun reading the article and learnt something new !! Leave some claps and comments , if you’d like.

--

--