Exoplanet Habitability P2 — Feature Selection

Jordan
5 min readJan 19, 2023

--

Image of a planet (decorative)
Photo by Daniel Olah on Unsplash

Hello everyone! Last time we left off after finishing all of the preprocessing we need to do. If you missed that article, check it out here before moving on. So, today we’re going to focus on feature selection, which is a process of eliminating unimportant features so that we can uncomplicate our model.

Before we eliminate some of our whopping 117 remaining features, we should split the data into a training and testing set. This allows us to test our model on unseen data from the testing set later on. Scikit Learn makes this pretty easy with their function called train_test_split. The following code block shows how this works.

#Gets X (feature data) and y (class label) data 
X = working_data.loc[:, working_data.columns != "P_HABITABLE"]
y = working_data["P_HABITABLE"]

#Uses train_test_split to split the data into a training and
#testing set with 25% of the data going to the latter.
X_train, X_test, y_train, y_test =\
train_test_split(X, y,
test_size=0.25,
random_state=1)

First, we take the portion of the current data frame that corresponds to the features and the class labels, respectively. Note that the dataset is called working_df due to our preprocessing step. Next, we can use the function mentioned above to split both the X feature data and the y class label data at 75%. So, we should be left with a test set that is 25% the size of the original data. We can confirm this simply by printing:

#Prints the ratio of training to testing data. 
print("Training/testing: ", len(X_train), "/", len(X_test))
# Gives Training/testing: 8984 / 2995

Great! Now we can move on to feature selection. Feature selection is useful to optimize our machine learning models (which we’ll work on in part three) because it simplifies what can be an overly complex situation. For example, say you’re deciding between two books. You have information about the price, reviews, publisher information, and the date the book was released. I would (personally) argue that the cost and popularity of the books would be more important than the publishing data. Having these unnecessary features might be confusing. Maybe one book seems to have better reviews, but the other came from a more well-known publisher or has more recent information. It might be more difficult to decide which book is better. This information overload leads to our machine learning models taking longer to ‘make decisions’ while training. So, we should try to eliminate any unnecessary information before proceeding.

There are different ways of selecting the most important features in a dataset. I chose a method based on some machine learning model. The model will be fitted to the data and then used to extract the feature weights. To start, let’s try a random forest classifier. The following code snippet defines the classifier and fits it to our data.

#Sets the column labels
feat_labels = df.columns[1:]

#Creates a Random Forest Classifier with 500 estimators.
forest = RandomForestClassifier(n_estimators=500, random_state=1)

#Fits the Random Forest Classifier to the training data.
forest.fit(X_train, y_train)

The first line gets the names of each of our features. Next, a random forest classifier is instantiated with some general parameters. Finally, the classifier is fit to our training data. Next, we can use a property of the classifier to get the feature importances.

#Sets the importances of the features. 
importances = forest.feature_importances_

#Sets the indices relative to the importances.
indices = np.argsort(importances)[::-1]

We also can create a variable to store the most salient indices using argsort to rearrange the list of importance. Now, let’s create a bar chart showing the feature’s importance!

#Creates a bar chart depicting this data. 
plt.title('Feature Importance')
plt.bar(range(X_train.shaspe[1]),
importances[indices],
align='center',
color="#ffc7f5")

#Plots the chart
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

The results are shown below. As you can see, the mass and age of the planet are the two most important features, whereas the distance is the least important.

Bar chart of feature importances.
Feature Importance

Now we can use a Scikit Learn function called Select From Model that will use our random forest classifier to determine which features should be kept based on a given threshold. The following command sets up the Select From Model object.

#Creates the select from model object using the forest from above and 
#the desired threshold.
sfm = SelectFromModel(forest, threshold=0.01, prefit=True)

This means that any features with importance below 1% will be removed. Now we can get the selected X data by using the transform method.

#Gets the feature columns from the select from model object. 
X_selected_train = sfm.transform(X_train)
X_selected_test = sfm.transform(X_test)

We can cast the array to a dataframe object for easier usage later. Pandas has a DataFrame method that does this. The second part of the code block below simply sets the column names for the dataframes we created.

#Creates dataframes from the selected train and test data.
A = pd.DataFrame(X_selected_train)
B = pd.DataFrame(X_selected_test)

#Sets the column names of these dataframes for viewability.
features = X_train.loc[:, sfm.get_support()].columns
A.columns = features
B.columns = A.columns

This process leaves us with only 15 features, about half of our remaining features from before! This leaves us with the following information.

Dataset with our selected features.
Dataset with selected features

Now we can move on to training! I trained and tested several models to get a baseline for which ones perform the best before we move on to hyperparameter tuning. Consider the code block below. The models dictionary creates all of the classifiers I wanted to test along with a key used for printing in the next step. Speaking of which, the next part fits each model to the training data and gets the accuracy using another function called cross_val_scores.

#Creates a dictionary of all of the models I wanted to test
models = {
"Logistic Regression": LogisticRegression(random_state=0, max_iter=10),
"Decision Tree": tree.DecisionTreeClassifier(max_depth = 5),
"Random Forest": RandomForestClassifier(n_estimators=5),
"Support Vector": SVC(),
"KNN": KNeighborsClassifier(),
"SGD": SGDClassifier(),
"Naive Bayes": GaussianNB()
}

#Loops through the dictionary above.
for name, model in models.items():
model.fit(A, y_train) #Fits each model to the selected training data
scores = cross_val_score(model,A,y_train,cv=5) #Evaluates the model using cross validation based on an SGD Classifier
print(name + " trained")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) #Prints the accuracy results.

This results in the following scores. As you can see, we did actually already achieve 100% accuracy with both the random forest and decision tree classifiers!

Accuracy Scores

So, yes, we could definitely stop here after achieving such high accuracy scores. But, I think it will be fun to see how much we can improve some of the lower-performing models using a grid search. Stay tuned for the next edition of this series to learn how to do just that!

--

--

Jordan

Beep Boop Beep Beep...... Just kidding! Not a robot, I'm just doing a PhD in robotics and machine learning. I love making engaging and fun AI tutorials!