Pulsars Detection, Hyperparameter Tuning and Visualization

Jatin Kataria
Analytics Vidhya
Published in
8 min readAug 2, 2020

Explore the power of Yellowbrick and mlxtend!

Pulsars are highly magnetized rotating neutron stars that emit beams of electromagnetic radiation out of their magnetic poles. Pulsars are one of the candidates for the source of ultra-high-energy cosmic rays. They are important as they help study extreme states of matter, search for exoplanets, measure cosmic distances and find gravitational waves.

We will do a classification on pulsars covering all the aspects of the problem including EDA using automated EDA module, feature selection using feature importance, hyperparameter tuning using RandomizedSearchCV, dealing with imbalanced classes and visualizations using Yellowbrick and mlxtend.

For the source code please visit the following link:

For accessing dataset, visit the following link:

Prerequisites

yellowbrick, mlxtend and imblearn need to be installed separately by following these steps: Install yellowbrick, mlxtend and imblearn using PIP command either in Command Prompt or Anaconda Prompt:

$ pip install yellowbrick
$ pip install mlxtend
$ pip install imblearn

Importing Packages

import timeit
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import mlxtend
import warnings
warnings.filterwarnings("ignore")
from ClfAutoEDA import *

ClfAutoEDA is an automated EDA for virtually any classification problem. Its usage can be understood by referring to my previous article:

Loading the Data

# Load the pulsar dataset from the csv file using pandas
df=pd.read_csv('pulsar_stars.csv')

Run the automated EDA program

#set the values of EDA function parameters and then run the program
labels=["non-pulsar","pulsar"]
target_variable_name='target_class'
df_processed,num_features,cat_features=EDA(df,labels,
target_variable_name,
data_summary_figsize=(6,6),
corr_matrix_figsize=(6,6),
corr_matrix_annot=True,
pairplt=True)

Voila! It automatically gives you the following data description and plots

The data looks like this: 
Mean of the integrated profile ... target_class
0 140.562500 ... 0
1 102.507812 ... 0
2 103.015625 ... 0
3 136.750000 ... 0
4 88.726562 ... 0
[5 rows x 9 columns]The shape of data is: (17898, 9)The missing values in data are:
target_class 0
Skewness of the DM-SNR curve 0
Excess kurtosis of the DM-SNR curve 0
Standard deviation of the DM-SNR curve 0
Mean of the DM-SNR curve 0
Skewness of the integrated profile 0
Excess kurtosis of the integrated profile 0
Standard deviation of the integrated profile 0
Mean of the integrated profile 0
dtype: int64
The summary of data is:
Mean of the integrated profile ... target_class
count 17898.000000 ... 17898.000000
mean 111.079968 ... 0.091574
std 25.652935 ... 0.288432
min 5.812500 ... 0.000000
25% 100.929688 ... 0.000000
50% 115.078125 ... 0.000000
75% 127.085938 ... 0.000000
max 192.617188 ... 1.000000
[8 rows x 9 columns]Some useful data information:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mean of the integrated profile 17898 non-null float64
1 Standard deviation of the integrated profile 17898 non-null float64
2 Excess kurtosis of the integrated profile 17898 non-null float64
3 Skewness of the integrated profile 17898 non-null float64
4 Mean of the DM-SNR curve 17898 non-null float64
5 Standard deviation of the DM-SNR curve 17898 non-null float64
6 Excess kurtosis of the DM-SNR curve 17898 non-null float64
7 Skewness of the DM-SNR curve 17898 non-null float64
8 target_class 17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB
None
The columns in data are:
[' Mean of the integrated profile'
' Standard deviation of the integrated profile'
' Excess kurtosis of the integrated profile'
' Skewness of the integrated profile' ' Mean of the DM-SNR curve'
' Standard deviation of the DM-SNR curve'
' Excess kurtosis of the DM-SNR curve' ' Skewness of the DM-SNR curve'
'target_class']
The target variable is divided into:
0 16259
1 1639
Name: target_class, dtype: int64
The numerical features are:
[' Mean of the integrated profile', ' Standard deviation of the integrated profile', ' Excess kurtosis of the integrated profile', ' Skewness of the integrated profile', ' Mean of the DM-SNR curve', ' Standard deviation of the DM-SNR curve', ' Excess kurtosis of the DM-SNR curve', ' Skewness of the DM-SNR curve', 'target_class']
The categorical features are:
[]
Execution Time for EDA: 0.48 minutes

Looking at all the above plots and description we can conclude:

  • No null values in the dataset
  • Imbalanced classes
  • Data is quite separable
  • No categorical features
  • Some features are highly skewed

When we have skewed features we make them normally distributed by transformation (log, box-cox etc.) but in this case, the skewness is because the features considered for pulsar description are literally measurements of skewness, standard deviations etc and therefore we need to consider such features as they are.

RadViz plot

It is a multivariate data visualization algorithm that plots each feature dimension uniformly around the circumference of a circle then plots points on the interior of the circle.

It is used to detect separability between classes i.e. is there an opportunity to learn from the feature set or is there just too much noise?

#dividing the X and the y
X=df_processed.drop([target_variable_name], axis=1)
y=df_processed[target_variable_name]
#RadViz plot
from yellowbrick.features import RadViz
visualizer = visualizer = RadViz(classes=labels, features=X.columns.tolist(),size = (800,300))
visualizer.fit(X, y)
visualizer.transform(X)
visualizer.show()

This plot shows that there not much noise and the classes seem to be quite separable.

Feature Importance

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
#import classification models
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
logreg=LogisticRegression()
SVM=SVC()
knn=KNeighborsClassifier()
etree=ExtraTreesClassifier(random_state=42)
rforest=RandomForestClassifier(random_state=42)
scaler=StandardScaler()features=X_train.columns.tolist()X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.fit_transform(X_test)
#get feature importance
start_time = timeit.default_timer()
mod=etree
# fit the model
mod.fit(X_train_scaled, y_train)
# get importance
importance = mod.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
df_importance=pd.DataFrame({'importance':importance},index=features)
df_importance.plot(kind='barh')
elapsed = timeit.default_timer() - start_time
print('Execution Time for feature selection: %.2f minutes'%(elapsed/60))

By looking at this plot, we can select the number of certain top features as per our choice when we have large number of features in the dataset. Here, we will use all 8 features.

Hyperparameter Tuning

In contrast to GridSearchCV, not all parameter values are tried out in RandomizedSearchCV, but rather a fixed number of parameter settings is sampled from the specified distributions and therefore it significantly speeds up the search.

#create a list of models you want to try for random search
models=[knn,rforest,etree,SVM]
#create a list of dictionaries containing hyperparameters of different models
param_distributions=[{'n_neighbors':[5,10,15]},{'criterion':['gini', 'entropy'],'n_estimators':[100,200,300]},{'criterion':['gini', 'entropy'],'n_estimators':[100,200,300]},{'kernel':['rbf','linear'],'C':[0.1,1,10],'gamma':[0.1,0.01,0.001]}]
for model in models: rand=RandomizedSearchCV(model,param_distributions=param_distributions[models.index(model)],cv=3,scoring='accuracy', n_jobs=-1, random_state=42,verbose=10)
rand.fit(X_train_sfs_scaled,y_train)
print(rand.best_params_,rand.best_score_)

It gives the best parameters and accuracy for ExtraTreesClassifier: {‘n_estimators’: 300, ‘criterion’: ‘gini’} 0.979565772669221

We will use these tuned hyperparameters and fit the model to the data.

Fitting and Visualizing the data

We will fit the data using tuned Extra Trees classification model and for visualizing the model performance use Yellowbrick. The model performance visualization will comprise of confusion matrix, classification report and ROC-AUC curve.

#fitting the data with tuned model
classes=['Non-Pulsar','Pulsar']
model=ExtraTreesClassifier(n_estimators=300,criterion='gini',random_state=42)
model.fit(X_train_sfs_scaled,y_train)
y_pred = model.predict(X_test_sfs_scaled)
#Creating a visualization function using Yellowbrick
from yellowbrick.classifier import ConfusionMatrix, ClassificationReport, ROCAUC
from sklearn.metrics import classification_report
def yellowbrick_visualizations(model,classes,X_tr,y_tr,X_te,y_te):
visualizer=ConfusionMatrix(model,classes=classes)
visualizer.fit(X_tr,y_tr)
visualizer.score(X_te,y_te)
visualizer.show()

visualizer = ClassificationReport(model, classes=classes, support=True)
visualizer.fit(X_tr,y_tr)
visualizer.score(X_te,y_te)
visualizer.show()

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_tr,y_tr)
visualizer.score(X_te,y_te)
visualizer.show()


yellowbrick_visualizations(model,classes,X_train_sfs_scaled,y_train,X_test_sfs_scaled,y_test)

Looking at the reports, we can see that recall for pulsar detection is only 83.3%. This is due to the fact that we did not rectify the class imbalance and therefore our model has a bias towards non-pulsars as they make up 90.84% of the total data.

Rectifying Class Imbalance

We can use various techniques like SMOTE, Near Miss, Random Sampling etc for dealing with class imbalance. We have used SMOTE for this problem but you can try out these different rectifiers and see the change.

from imblearn.over_sampling import SMOTE,RandomOverSampler,BorderlineSMOTE
from imblearn.under_sampling import NearMiss,RandomUnderSampler
smt = SMOTE()
nr = NearMiss()
bsmt=BorderlineSMOTE(random_state=42)
ros=RandomOverSampler(random_state=42)
rus=RandomUnderSampler(random_state=42)
#Use one of these class imbalance rectifiers to balance the data
X_train_bal, y_train_bal = smt.fit_sample(X_train_sfs_scaled, y_train)
print(np.bincount(y_train_bal))
#fit the tuned model with balanced data
model_bal=model
model_bal.fit(X_train_bal, y_train_bal)
y_pred = model_bal.predict(X_test_sfs_scaled)

yellowbrick_visualizations(model_bal,classes,X_train_bal, y_train_bal,X_test_sfs_scaled,y_test)

Looking at the reports, we can see that recall for pulsar detection has improved to nearly 90% ( from 83.3%) with a high precision value (88.8%) .

Decision Region

A decision region is an area , marked by cuts in the pattern space. All of the patterns within a usable decision region belong to the same class. As a result, the location of a pattern — identifying what decision region it lies in — can be used to classify it.

We will use mlxtend to plot the decision region for our tuned Extra Trees Classifier by first reducing the number of features to 2 using PCA.

from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions as plot_dr
#Plot decision region
def plot_classification(model,X_t,y_t):
clf=model
pca = PCA(n_components = 2)
X_t2 = pca.fit_transform(X_t)
clf.fit(X_t2,np.array(y_t))
plot_dr(X_t2, np.array(y_t), clf=clf, legend=2)
plot_classification(model_bal,X_test_sfs_scaled,y_test)

To learn more about Yellowbrick and mlxtend, see their documentation:

Before You Go

Thanks for reading! Feel free to apply this methodology to your classification problems. If you have any difficulty or any doubts kindly comment below. Your support is always highly appreciated. If you want to get in touch with me, reach me on jatin.kataria94@gmail.com.

--

--