By Mr. Data Science
An excellent way to learn data science is to do data science: get some data and start analyzing it. The techniques used in this article can be applied to any data, and some of the issues we will encounter are typical of the challenges real-world data analysis throws up.
This article will investigate some data on asteroids to find if there is a threat of collision. Example 3 will use machine learning to classify asteroids as potential threats. We will come up against a common issue with data — unbalanced datasets, where most of the training data is just one class
Motivation for our analysis:
Every few years, the media get excited about an asteroid that will come close to the Earth, stories about the end of the world will appear on the internet, in newspapers, etc. There seems to be some interest in this topic within popular culture because Hollywood has released several movies with an asteroid or comet threatening the Earth. But how much of a threat do asteroids pose? This article will analyze some of the data we have on asteroids looking at things like the probability of a collision with the Earth and what that would mean based on the object’s size.
First, some quick definitions and information: an asteroid is a small rocky object orbiting the sun. In this case, small could mean something as small as a typical family car or almost as large as a continent or anything in-between. The largest asteroid is an object called Ceres. It is just under 600 miles in diameter (just under 1000km). Most asteroids orbit the sun between the orbits of the planets Mars and Jupiter. This puts them far away from the Earth, so they do not represent any threat to us. However, other asteroids are classified as Near Earth Objects, as the name implies they could represent a threat.
Side note: apparently, June 30 is World Asteroid Day.
As mentioned above, we will also look at working with unbalanced data. In the real world, this can be a common issue in classification problems. Sometimes it is the normal situation, for example, fraud detection or cybersecurity. In these situations, the problem is usually treated as outlier detection. We will look at some possible solutions, including a one-class support vector machine approach.
Before we get started, lets set up your environment:
In order to analyze this data, you will need to install the following python libraries:
Near Earth Asteroid Data Analysis
Let’s look at a few examples:
Example 1 — how many asteroids are near earth objects?
As we’ve already seen, some asteroids orbit the sun millions of miles from the Earth so they can be ignored as potential hazards. It is the near-earth objects (NEOs) that are potentially more dangerous. In this example, we will analyze the data to determine the percentage of total asteroids that are NEOs and also what percentage of NEOs have orbits that cross the Earth’s orbit, this is important because it is these objects that are potential impact objects. To answer these questions we will look at:
- data types in pandas
- how to sub-divide a pandas dataset
- how to visualize the distribution of values within a column
Dataset used is available on Kaggle.
import pandas as pddf_1 = pd.read_csv('asteroidData/asteroid_data.csv')D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (3,4,5) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Running the above command generates the first issue with this data. The warning: “Columns (3,4,5) have mixed types.Specify dtype option on import or set low_memory=False.” is telling me that some of the columns seem to have more than one data type, processing this data could then be very costly in terms of the memory required. The first step to solving this is to take a look at the data:
columns 3, 4 and 5 are the full_name, pdes and name columns. The pdes column is just a numerical index for the rows starting at a value of 1 and incremented by 1 for each new row. Pandas already gives us a numerical row index so we don’t need this column, the easiest fix for this column is to drop it.
df_1 = df_1.drop('pdes', axis=1)df_1.head(2)
The columns full_name and name contain basically the same data, full_name is the name plus the numerical index we dropped in the previous step. Not all asteroids have been given a name, some have only an alphanumerical catalog ID, the full_name column includes these IDs whereas the name column does not. So we can also drop the column ‘name’:
df_1 = df_1.drop('name', axis=1)
Note: you can drop more than one column at the same time by defining a list of the column names, something like
columns_to_drop = [‘column_name_1’, ‘column_name_2’]
df = df.drop(columns_to_drop)
We still have the full_name column with different data types, we can investigate this:
The dtpe = ‘O’ in pandas is a ‘mixed’ type, it could include strings, NaN, numbers, you can read more about dtypes in the pandas documentation here.
We’ll keep this column as it is for now.
Two columns that could be of interest to us are neo (near earth object) and pha (potential hazardous asteroid)
So less than 10% of the asteroids in this dataset are NEOs. Of these how many actually pose a threat? We can create a subset of our data as a new dataframe using a conditional selection:
df_neo = df_1[df_1['neo']=='Y']
every row in df_neo will have a value of Y for the column ‘neo’
So about 10% of NEOs are potential risks to the Earth. That’s about 1% of the asteroids in the original dataset. In terms of actual numbers:
Name: pha, dtype: int64
Just over 2,000 asteroids in this dataset are classed as potential hazards. While that’s just a tiny percentage of the total number of known asteroids it is still a significant number. In the next section we will look at the level of risk these objects represent
Objects enter the Earth’s atmosphere every day, they tend to be so small that they burn up in the atmosphere and don’t impact the surface. The exact size the object has to be to survive passing through the atmosphere depends on factors like the angle of entry, speed and what the object is made of. A ball-park figure would be about half an inch in diameter for rocky objects. If an object this size hit your house or car it would do some minor damage. It is estimated that an asteroid about 60 miles (about 100km) in diameter could end most life on Earth. So what is the largest object in the dataframe df_neo:
# the data measures diameters in meters
So the largest of these objects is well below the critical size of 100km. Smaller objects can still have a devastating effect. An object the size of a small house, maybe 15 to 20 meters across could release more energy than the 1945 Hiroshima nuclear bomb. That kind of explosion close to a town or city could cause loss of life. Let’s look at the size distribution of the asteroids:
We can see that most of the objects in this dataset are less than ten meters in diameter. So this data suggests that the probability of a major asteroid impact causing large-scale loss of life are low. Different datasets can tell different stories, the next dataset gives actual probabilities of impacts and the Palermo Technical Impact Hazard Scale values, this is a logarithmic value that takes account of the probability of impact as well as the potential energy released by any impact. This Wikipedia page gives more details.
The dataset is available on Kaggle
df_2 = pd.read_csv('asteroidData/impacts.csv')df_2.head(2)
This dataset gives an estimate of the actual probability of an impact, it also has a start and end year for the period of possible impacts. The first thing we can do is eliminate any asteroids with a ‘Period End’ value before 2021 (the year this article was written).
df_3 = df_2[df_2['Period End'] > 2020]df_3.head()
The asteroids in df_3 all have a possibility of colliding with the Earth. The Torino scale is a more basic form of the Palermo scale. A value of 10 is very bad, it represents a global catastrophe, a value of zero represents an asteroid that is not a threat.
df_3['Maximum Torino Scale'].value_counts()0 677
Name: Maximum Torino Scale, dtype: int64
The Torino scale values also indicate there are no major threats from asteroid collision.
According to the Wikipedia page for the Palermo scale: “Scale values less than −2 reflect events for which there are no likely consequences, while Palermo Scale values between −2 and 0 indicate situations that merit careful monitoring.”
So let’s get the maximum value for this column:
df_3['Maximum Palermo Scale'].max()-1.42
Now this data suggests there could be potential dangerous collisions. The number of asteroids with ‘Maximum Palermo Scale’ above -2 is:
df_4 = df_3[df_3['Maximum Palermo Scale']>-2]df_4.shape2
.shape returns the number of rows and columns, the first element is the row count.
So there are two objects in this dataset which humanity should watch. The objects are:
However the possible impact period doesn’t start for at least another 164 years. It is unlikely that many of us will still be here.
To sum up: the threat from known NEOs is low. Of course we have not considered undiscovered objects; if something the size of a house could cause an explosion similar to the Hiroshima bomb then we can not ignore the problem, that’s why organizations like NASA look for and track these objects.
In the next section we will use some machine learning on the data.
Example 3 — dealing with unbalanced training data
A highly unbalanced dataset can result in something called the accuracy paradox. Let’s say we want to predict if an asteroid is a potential hazzard, so we want to predict the value in the ‘pha’ column. In this case our dataframe df_neo is very unbalanced, almost all the training data (91%) belongs to one class — pha = ‘N’:
Name: pha, dtype: float64
In this situation if we just labeled everything as ’N’ we would get an accuracy of 91%. But of course this value is meaningless. There are several possible ways to handle this situation, for example:
- we can try resampling, either over-sample (add copies of the minority sample) or under-sample (remove most of the majority class examples)
- generate synthetic examples
- if possible get more genuine examples of the minority class
- some algorithms, called penalized models, put more emphasis on correctly predicting the under represented class
- one class classification algorithms, these algorithms are suited to situations like credit card fraud and cyber security — situations where most examples are in one class (the normal class) but occasionally there is abnormal activity.
Reference  goes into more detail on classification of unbalanced datasets using SVM classifiers.
Let’s look at an example applying Scikit Learn’s single class SVM.
from sklearn.svm import OneClassSVM
to keep things simple we will drop all non-numeric columns:
drop_columns = ['id','spkid','full_name','prefix','neo','orbit_id','equinox','class']
df_neo = df_neo.drop(drop_columns,axis=1)df_neo.shape(22895, 35)df_neo.isnull().sum()pha 1
three of the remaining columns are mostly nulls so the easiest solution is to drop those columns as well then run df_neo.dropna() to get rid of the remaining rows that contain nulls.
df_neo = df_neo.drop(['diameter','albedo','diameter_sigma'],axis=1)df_neo = df_neo.dropna();df_neo.shape(22883, 32)
Now we can create a predictive model. We create a variable ‘y’, this is the value we are trying to predict (the pha column). The other columns in the data become a variable called X, the SVM will try to find statistical patterns in this data:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
X = df_neo.drop('pha',axis=1)
y = df_neo['pha']
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
model = OneClassSVM(gamma='scale', nu=0.01)
#trainX = trainX[trainy==0]
y_pred_test = model.predict(testX)
yhat = model.predict(testX)
print(y_pred_test)[ 1 1 1 ... 1 -1 -1]
The model outputs a list where each element is either a 1 or -1, 1 = the majority class which in this case is pha = N and -1 indicates objects which are possibly hazardous. I’ll leave it as an exercise for you to determine if we are at risk or not. Note that there are many potential applications for the one class SVM model, including credit card fraud detection, cyber security and reference  gives a potential medical application. I encourage you to explore their use in other applications.
A quick review of what you’ve learned:
If you’ve made it this far, you should have a good understanding of:
- How to drop columns and check data types for columns in a pandas dataframe
- How to sub-divide a dataframe based on a condition
- How to deal with unbalanced training data for classification problems
If you have any feedback or suggestions for improving this article, we would love to hear from you.
- Yuchun, T., SVMs Modeling for Highly Imbalanced Classification, date retrieved=04/27/2021, [link](https://ieeexplore.ieee.org/abstract/document/4695979)
- Mourão-Miranda, J., Patient classification as an outlier detection problem: An application of the One-Class Support Vector Machine, date retrieved=04/27/2021, [link](https://www.sciencedirect.com/science/article/pii/S1053811911006872)