Exoplanet Habitability — Preprocessing Tutorial

9 min readDec 23, 2022

NASA confirmed the discovery of around 5000 new exoplanets earlier this year. Machine learning can be a great tool in determining if any of those exoplanets are habitable or not. There is a lot of data on the planets that can be difficult to go through by hand, so machine learning models can drastically speed things up by automating the determination of (possible) habitability.

The Data

There are several sources of data surrounding exoplanet habitability. The dataset I used is linked below.

https://www.kaggle.com/datasets/chandrimad31/phl-exoplanet-catalog

The Kaggle data has 117 features about each planet, as the following snapshot indicates. The class label, called “P_HABITABLE, ” has three categories: 0 indicates the exoplanet is not habitable, 1 is possibly habitable, and 2 is optimistically habitable. The features include information about the planets including mass, temperature, and location (among many others).

Getting Started

First, we need to import a bunch of stuff. I know this is a long list, but I’ll explain everything as we use it.

# imports

#Data
import pandas as pd
import numpy as np

#Machine Learning Utilities
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import SelectFromModel

#Machine Learning Models
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import tree
from sklearn.naive_bayes import GaussianNB

#Plotting:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

The next step is to create a data-frame from the Kaggle data using a Python library called pandas. If you haven’t used pandas before, now is a great time to start! It’s one of the most important libraries that I use in machine learning and data science. In this stage, it will allow us to read a ‘.csv’ file (one containing comma-separated values) into a convenient table.

Put your data into the directory you want to code in (probably where your Python script already is). I used Google Colab when I wrote this code, so I put my data in Google drive. By the way, Google Colab is a great tool for coding in the browser if you’re interested! Once everything is snuggled up under the same folder, we can use the following code snippet to bring in the exoplanet data.

data = google_file_prefix + ‘phl_exoplanet_catalog_2019.csv’
df = pd.read_csv(data)

To double-check that you’ve successfully loaded your data, you can run the command df.head() to see the first five lines (exoplanets) of the dataframe. Now we’re all set up for the next stage!

Preprocessing

Most data needs to be preprocessed before it can be used. Preprocessing involves dealing with missing values and categorical data and even handling class imbalance. Let’s start with missing values. The reason that we have to resolve null or missing data points is that our model might not be able to recognize trends and patterns without all of the data being complete. Consider a math test as an example. You might not get a good grade if you haven’t studied all of the material or if the professor didn’t mention something that was going to be on the test.

There are several different ways that we can account for missing data, but first, we need to understand what we’re working with. The following line sums the total number of null values in each column of the dataframe and sorts them in descending order.

nullVals = df.isnull().mean().sort_values(ascending=False)

The first 30 or so columns are shown below.

As you can see, there are a few columns that are completely empty! There isn’t any point in trying to give those features any numbers, so we can simply remove them. Further, I decided to remove any features with over 60% of missing values. The following line of code removes each of the desired columns from the data.

df = df.drop(["P_DETECTION_MASS", "P_GEO_ALBEDO", "S_MAGNETIC_FIELD", "S_DISC", "P_ATMOSPHERE", "P_ALT_NAMES", "P_DETECTION_RADIUS", "P_GEO_ALBEDO_ERROR_MIN", "P_TEMP_MEASURED", "P_GEO_ALBEDO_ERROR_MAX",
              "P_TPERI_ERROR_MAX", "P_TPERI_ERROR_MIN", "P_TPERI", "P_OMEGA_ERROR_MIN", "P_OMEGA_ERROR_MAX", "P_DENSITY", "P_ESCAPE", "P_POTENTIAL", "P_GRAVITY", "P_OMEGA",
              "P_INCLINATION_ERROR_MAX", "P_INCLINATION_ERROR_MIN", "P_INCLINATION", "P_ECCENTRICITY_ERROR_MAX", "P_ECCENTRICITY_ERROR_MIN", "S_TYPE", "P_ECCENTRICITY",
              "P_IMPACT_PARAMETER_ERROR_MIN", "P_IMPACT_PARAMETER_ERROR_MAX", "P_IMPACT_PARAMETER", "P_MASS_ERROR_MAX", "P_MASS_ERROR_MIN", "P_HILL_SPHERE", 
              "P_SEMI_MAJOR_AXIS_ERROR_MIN", "P_SEMI_MAJOR_AXIS_ERROR_MAX", "P_MASS"], axis=1)

So, we’re all done with missing values, right? Sorry, we still have a bunch left to deal with! We can impute null values in several ways. Imputation involves inferring information about the current row/column from the rest of the column. Different types of data might be better imputed using different methods. Let’s take a look at the categorical missing values first.

Categorical values are non-numerical and describe the subject in some way. For example, the “P_TYPE” column in our data describes the exoplanet category and consists of six possible values: Jovian, Superterran, Neptunian, Terran, Subterran, and Miniterran. This feature had 17 missing values.

To work with missing values in text-based data, we can use the mode of the column. For example, if most planets are terran, we can fill the missing values in that column with “terran” and we might be right. There were three categorical columns with missing data, so the following code block will fill all of them using a method called ‘fillna’. This essentially does what is sounds like. It fills all null/missing rows in a column with the desired value.

df["P_TYPE"] = df["P_TYPE"].fillna(df["P_TYPE"].mode()[0])
df["P_TYPE_TEMP"] = df["P_TYPE_TEMP"].fillna(df["P_TYPE_TEMP"].mode()[0])
df["S_TYPE_TEMP"] = df["S_TYPE_TEMP"].fillna(df["S_TYPE_TEMP"].mode()[0])

Finally, there are still missing values in the numerical data which can be imputed in a few different ways. The most common approach uses SciKit Learn’s Simple Imputer, which is a univariate technique focusing on one feature column at a time. Due to the number of missing values across multiple columns in the data, I thought this could be a good use case for multivariate imputation. This involves using all columns from rows without missing values to determine reasonable inputs for those that must be imputed. This technique is slightly more complex, but I thought it would give more useful data in the missing rows.

Scikit Learn has an Iterative Imputer method that employs a kind of looping function in which features containing missing values are modeled in relation to the other features. This multivariate technique is useful in retrieving more accurate estimates of the missing values in each column. However, it is also a bit more complicated and still in the experimental stage, according to Scikit Learn’s description. So, you may want to try the Simple Imputer instead if you’re newer to machine learning.

To implement the iterative imputer, the following libraries should be imported (also shown above). The first line enables the iterative imputer from Scikit Learn’s experimental category and the second actually imports it.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Next, we can apply the imputer to our data as follows. The first line creates an iterative imputer object with the default parameters discussed here. Then, we fit the imputer to our data and transform it to get our results. This returns a numpy array rather than a pandas dataframe, so we can use the command ‘pandas.DataFrame’ to convert it.

imp = IterativeImputer(max_iter=10, verbose=0) #Creates an iterative imputer object
imp.fit(df) #Fits the imputer to the dataframe
imputed_df = imp.transform(df) #Transforms the data
df = pd.DataFrame(imputed_df, columns=df.columns) #Creates a dataframe from the imputed data.

Yeah, we have corrected all of our missing values! But, we still have a major class imbalance to contend with. Since we don’t know of many potentially habitable planets, the number of inhabitable exoplanets greatly outnumbers the habitable ones. This will likely cause our model to guess an exoplanet is inhabitable, even if it’s not the case, because it will still have an average of like 99% accuracy. Thus, we need to find a way to balance the scales.

Again, there are plenty of different algorithms to explore for this process. First, let’s look at the broader scope. There are two essential techniques for balancing the distribution of classes in our dataset: upsampling and downsampling. Each has merits in different situations. Upsampling is, essentially, the process of adding data to our lower-sized class. Conversely, downsampling is used to reduce the size of the larger class. In our situation, I would argue that upsampling is the better bet since, to downsample the proportion of inhabitable planets, we would be left with very little data to work from.

Upsampling, again, can be accomplished with different strategies. I chose to use something called Synthetic Minority Oversampling Technique (SMOTE for short). Rather than simply repeating the exact same rows over and over again, this algorithm creates new rows by obtaining values from the neighbors of a given row. I used the 5 nearest neighbors in my implementation, but it didn’t seem like the results differed very much with various other numbers of neighbors. The following code block shows how to implement SMOTE using the previously imported imblearn class.

#Handling class imbalance with SMOTE algorithm
seed = 100 #Creates a random seed 
k = 5 #Number of neighbors

X = df.loc[:, df.columns != "P_HABITABLE"] #Data
y = df["P_HABITABLE"] #Class label

smote = SMOTE(sampling_strategy='auto', k_neighbors=k, random_state=seed) #Creates a SMOTE object with designated paramaters above
X_res, y_res = smote.fit_resample(X, y) #Resamples the input data

df = pd.concat([pd.DataFrame(X_res), pd.DataFrame(y_res)], axis=1) #Concatenates dataframe objects of the X and y data
df.head()

The first two lines of code create variables for a random seed (used in the algorithm for the random generation of a weighting parameter) and the number of neighbors. Next, we define the X data (all of the features) and the y data (the class labels). A SMOTE object is created using the variables we created that is then fit to the data so that it can be resampled. Finally, we reform our dataframe using the resampled feature and class data. The output of this is shown below.

Now we have about 4000 samples in each class label. This should allow us to train a more balanced and applicable model. That’s about it for the basic preprocessing! But, I can’t let you leave without a few more fun graphs! So, let’s dig a little deeper into the correlations between each feature. The reason that this is important, especially when we have so many feature columns, is because the model might learn correlations that aren’t really related to the actual habitability of the planet. We can see how much correlation exists between two variables using a heatmap and a pandas method called corr, as shown below.

# Correlation Heatmap
corr = df.corr() #Creates a correlation object
mask = np.triu(np.ones_like(corr, dtype=np.bool)) #Mask
f, ax = plt.subplots(figsize=(20, 20))
#Plots a heatmap of the correlation between each feature column. 
#1 - highly correlated 
sns.heatmap(corr, mask=mask, cmap=pal, vmax=None, center=0,square=True, annot=False, linewidths=.5, cbar_kws={"shrink": 0.9})

Don’t worry too much about the plotting details, but if you’re curious about how to come up with a fun color scheme for your heatmap, check out this great blog post by Michael Blow! I used seaborn to plot the heatmap of the correlation results, which is shown below. The bright pink squares indicate that the two corresponding variables are highly related to one another and might cause unwanted connections in our machine learning model.

As you can probably see, there are several bright pink squares in our heatmap. I decided to remove one of each of the correlated pairs.

#Drops one of each of the highly correlated pairs
working_data = df.drop(['S_NAME', 'P_RADIUS', 'P_RADIUS_ERROR_MIN', 'P_RADIUS_ERROR_MAX', 'P_DISTANCE', 'P_PERIASTRON', 'P_APASTRON', 
                                 'P_DISTANCE_EFF', 'P_FLUX_MIN', 'P_FLUX_MAX', 'P_TEMP_EQUIL', 'P_TEMP_EQUIL_MIN', 'P_TEMP_EQUIL_MAX', 
                                 'S_RADIUS_EST', 'S_RA_H', 'S_RA_T', 'S_LUMINOSITY', 'S_HZ_OPT_MIN', 'S_HZ_OPT_MAX', 'S_HZ_CON_MIN', 
                                 'S_HZ_CON_MAX', 'S_HZ_CON0_MIN', 'S_HZ_CON0_MAX', 'S_HZ_CON1_MIN', 'S_HZ_CON1_MAX', 'S_SNOW_LINE', 
                                'P_PERIOD_ERROR_MIN', 'P_PERIOD_ERROR_MAX', 'S_MAG', 'S_DISTANCE_ERROR_MIN', 'S_DISTANCE_ERROR_MAX', 
                                 'S_METALLICITY', 'S_METALLICITY_ERROR_MIN', 'S_METALLICITY_ERROR_MAX', 'S_AGE', 'S_TEMPERATURE_ERROR_MIN', 
                                 'S_TEMPERATURE_ERROR_MAX', 'S_ABIO_ZONE', 'P_ESI', 'S_CONSTELLATION_ABR', 'P_SEMI_MAJOR_AXIS_EST'], axis=1)

This results in a new heatmap, shown below. Now most of the features are not very correlated to one another and our model should have better performance overall.

And that’s about it! We’re ready to move on to feature selection in the next part! You can find all of the code in this tutorial and the other parts in the series here. See you next time!

Exoplanet Habitability — Preprocessing Tutorial

The Data

Getting Started

Preprocessing

Written by Jordan