Pulsars Detection, Hyperparameter Tuning and Visualization

Published in

Analytics Vidhya

8 min readAug 2, 2020

Explore the power of Yellowbrick and mlxtend!

Pulsars are highly magnetized rotating neutron stars that emit beams of electromagnetic radiation out of their magnetic poles. Pulsars are one of the candidates for the source of ultra-high-energy cosmic rays. They are important as they help study extreme states of matter, search for exoplanets, measure cosmic distances and find gravitational waves.

We will do a classification on pulsars covering all the aspects of the problem including EDA using automated EDA module, feature selection using feature importance, hyperparameter tuning using RandomizedSearchCV, dealing with imbalanced classes and visualizations using Yellowbrick and mlxtend.

For the source code please visit the following link:

jatinkataria94/Pulsar-star-detection

Contribute to jatinkataria94/Pulsar-star-detection development by creating an account on GitHub.

github.com

For accessing dataset, visit the following link:

jatinkataria94/Pulsar-star-detection

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Prerequisites

yellowbrick, mlxtend and imblearn need to be installed separately by following these steps: Install yellowbrick, mlxtend and imblearn using PIP command either in Command Prompt or Anaconda Prompt:

$ pip install yellowbrick
$ pip install mlxtend
$ pip install imblearn

Importing Packages

import timeit
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import mlxtend
import warnings
warnings.filterwarnings("ignore")from ClfAutoEDA import *

ClfAutoEDA is an automated EDA for virtually any classification problem. Its usage can be understood by referring to my previous article:

Automated EDA for Classification

Exploratory Data Analysis made simple in few lines of code!

medium.com

Loading the Data

# Load the pulsar dataset from the csv file using pandas
df=pd.read_csv('pulsar_stars.csv')

Run the automated EDA program

#set the values of EDA function parameters and then run the program
labels=["non-pulsar","pulsar"]
target_variable_name='target_class'df_processed,num_features,cat_features=EDA(df,labels,
                                         target_variable_name,
                                         data_summary_figsize=(6,6),
                                         corr_matrix_figsize=(6,6),
                                         corr_matrix_annot=True,
                                         pairplt=True)

Voila! It automatically gives you the following data description and plots

The data looks like this: 
     Mean of the integrated profile  ...  target_class
0                       140.562500  ...             0
1                       102.507812  ...             0
2                       103.015625  ...             0
3                       136.750000  ...             0
4                        88.726562  ...             0[5 rows x 9 columns]The shape of data is:  (17898, 9)The missing values in data are: 
 target_class                                     0
 Skewness of the DM-SNR curve                    0
 Excess kurtosis of the DM-SNR curve             0
 Standard deviation of the DM-SNR curve          0
 Mean of the DM-SNR curve                        0
 Skewness of the integrated profile              0
 Excess kurtosis of the integrated profile       0
 Standard deviation of the integrated profile    0
 Mean of the integrated profile                  0
dtype: int64The summary of data is: 
         Mean of the integrated profile  ...  target_class
count                     17898.000000  ...  17898.000000
mean                        111.079968  ...      0.091574
std                          25.652935  ...      0.288432
min                           5.812500  ...      0.000000
25%                         100.929688  ...      0.000000
50%                         115.078125  ...      0.000000
75%                         127.085938  ...      0.000000
max                         192.617188  ...      1.000000[8 rows x 9 columns]Some useful data information:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0    Mean of the integrated profile                17898 non-null  float64
 1    Standard deviation of the integrated profile  17898 non-null  float64
 2    Excess kurtosis of the integrated profile     17898 non-null  float64
 3    Skewness of the integrated profile            17898 non-null  float64
 4    Mean of the DM-SNR curve                      17898 non-null  float64
 5    Standard deviation of the DM-SNR curve        17898 non-null  float64
 6    Excess kurtosis of the DM-SNR curve           17898 non-null  float64
 7    Skewness of the DM-SNR curve                  17898 non-null  float64
 8   target_class                                   17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB
NoneThe columns in data are: 
 [' Mean of the integrated profile'
 ' Standard deviation of the integrated profile'
 ' Excess kurtosis of the integrated profile'
 ' Skewness of the integrated profile' ' Mean of the DM-SNR curve'
 ' Standard deviation of the DM-SNR curve'
 ' Excess kurtosis of the DM-SNR curve' ' Skewness of the DM-SNR curve'
 'target_class']The target variable is divided into: 
 0    16259
1     1639
Name: target_class, dtype: int64The numerical features are: 
 [' Mean of the integrated profile', ' Standard deviation of the integrated profile', ' Excess kurtosis of the integrated profile', ' Skewness of the integrated profile', ' Mean of the DM-SNR curve', ' Standard deviation of the DM-SNR curve', ' Excess kurtosis of the DM-SNR curve', ' Skewness of the DM-SNR curve', 'target_class']The categorical features are: 
 []Execution Time for EDA: 0.48 minutes

Looking at all the above plots and description we can conclude:

No null values in the dataset
Imbalanced classes
Data is quite separable
No categorical features
Some features are highly skewed

When we have skewed features we make them normally distributed by transformation (log, box-cox etc.) but in this case, the skewness is because the features considered for pulsar description are literally measurements of skewness, standard deviations etc and therefore we need to consider such features as they are.

RadViz plot

It is a multivariate data visualization algorithm that plots each feature dimension uniformly around the circumference of a circle then plots points on the interior of the circle.

It is used to detect separability between classes i.e. is there an opportunity to learn from the feature set or is there just too much noise?

#dividing the X and the y
X=df_processed.drop([target_variable_name], axis=1)
y=df_processed[target_variable_name]#RadViz plot
from yellowbrick.features import RadViz
visualizer = visualizer = RadViz(classes=labels, features=X.columns.tolist(),size = (800,300))
visualizer.fit(X, y)      
visualizer.transform(X)  
visualizer.show()

This plot shows that there not much noise and the classes seem to be quite separable.

Feature Importance

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)#import classification models
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, RandomizedSearchCVlogreg=LogisticRegression()
SVM=SVC()
knn=KNeighborsClassifier()
etree=ExtraTreesClassifier(random_state=42)
rforest=RandomForestClassifier(random_state=42)scaler=StandardScaler()features=X_train.columns.tolist()X_train_scaled=scaler.fit_transform(X_train) 
X_test_scaled=scaler.fit_transform(X_test)#get feature importance
start_time = timeit.default_timer()
mod=etree
# fit the model
mod.fit(X_train_scaled, y_train)
# get importance
importance = mod.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
 print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
df_importance=pd.DataFrame({'importance':importance},index=features)
df_importance.plot(kind='barh')
elapsed = timeit.default_timer() - start_time
print('Execution Time for feature selection: %.2f minutes'%(elapsed/60))

By looking at this plot, we can select the number of certain top features as per our choice when we have large number of features in the dataset. Here, we will use all 8 features.

Hyperparameter Tuning

In contrast to GridSearchCV, not all parameter values are tried out in RandomizedSearchCV, but rather a fixed number of parameter settings is sampled from the specified distributions and therefore it significantly speeds up the search.

#create a list of models you want to try for random search
models=[knn,rforest,etree,SVM]#create a list of dictionaries containing hyperparameters of different models
param_distributions=[{'n_neighbors':[5,10,15]},{'criterion':['gini', 'entropy'],'n_estimators':[100,200,300]},{'criterion':['gini', 'entropy'],'n_estimators':[100,200,300]},{'kernel':['rbf','linear'],'C':[0.1,1,10],'gamma':[0.1,0.01,0.001]}]for model in models: rand=RandomizedSearchCV(model,param_distributions=param_distributions[models.index(model)],cv=3,scoring='accuracy', n_jobs=-1, random_state=42,verbose=10)
rand.fit(X_train_sfs_scaled,y_train)
print(rand.best_params_,rand.best_score_)

It gives the best parameters and accuracy for ExtraTreesClassifier: {‘n_estimators’: 300, ‘criterion’: ‘gini’} 0.979565772669221

We will use these tuned hyperparameters and fit the model to the data.

Fitting and Visualizing the data

We will fit the data using tuned Extra Trees classification model and for visualizing the model performance use Yellowbrick. The model performance visualization will comprise of confusion matrix, classification report and ROC-AUC curve.

#fitting the data with tuned model
classes=['Non-Pulsar','Pulsar']
model=ExtraTreesClassifier(n_estimators=300,criterion='gini',random_state=42)
model.fit(X_train_sfs_scaled,y_train)
y_pred = model.predict(X_test_sfs_scaled)#Creating a visualization function using Yellowbrick
from yellowbrick.classifier import ConfusionMatrix, ClassificationReport, ROCAUC
from sklearn.metrics import  classification_reportdef yellowbrick_visualizations(model,classes,X_tr,y_tr,X_te,y_te):
    visualizer=ConfusionMatrix(model,classes=classes)
    visualizer.fit(X_tr,y_tr)
    visualizer.score(X_te,y_te)
    visualizer.show()
    
    visualizer = ClassificationReport(model, classes=classes, support=True)
    visualizer.fit(X_tr,y_tr)
    visualizer.score(X_te,y_te)
    visualizer.show()
    
    visualizer = ROCAUC(model, classes=classes)
    visualizer.fit(X_tr,y_tr)
    visualizer.score(X_te,y_te)
    visualizer.show()
    

yellowbrick_visualizations(model,classes,X_train_sfs_scaled,y_train,X_test_sfs_scaled,y_test)

Looking at the reports, we can see that recall for pulsar detection is only 83.3%. This is due to the fact that we did not rectify the class imbalance and therefore our model has a bias towards non-pulsars as they make up 90.84% of the total data.

Rectifying Class Imbalance

We can use various techniques like SMOTE, Near Miss, Random Sampling etc for dealing with class imbalance. We have used SMOTE for this problem but you can try out these different rectifiers and see the change.

from imblearn.over_sampling import SMOTE,RandomOverSampler,BorderlineSMOTE
from imblearn.under_sampling import NearMiss,RandomUnderSamplersmt = SMOTE()
nr = NearMiss()
bsmt=BorderlineSMOTE(random_state=42)
ros=RandomOverSampler(random_state=42)
rus=RandomUnderSampler(random_state=42)#Use one of these class imbalance rectifiers to balance the data
X_train_bal, y_train_bal = smt.fit_sample(X_train_sfs_scaled, y_train)
print(np.bincount(y_train_bal))#fit the tuned model with balanced data
model_bal=model
model_bal.fit(X_train_bal, y_train_bal)
y_pred = model_bal.predict(X_test_sfs_scaled)

yellowbrick_visualizations(model_bal,classes,X_train_bal, y_train_bal,X_test_sfs_scaled,y_test)

Looking at the reports, we can see that recall for pulsar detection has improved to nearly 90% ( from 83.3%) with a high precision value (88.8%) .

Decision Region

A decision region is an area , marked by cuts in the pattern space. All of the patterns within a usable decision region belong to the same class. As a result, the location of a pattern — identifying what decision region it lies in — can be used to classify it.

We will use mlxtend to plot the decision region for our tuned Extra Trees Classifier by first reducing the number of features to 2 using PCA.

from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions as plot_dr#Plot decision region
def plot_classification(model,X_t,y_t):
    clf=model
    pca = PCA(n_components = 2)
    X_t2 = pca.fit_transform(X_t)
    clf.fit(X_t2,np.array(y_t))
    plot_dr(X_t2, np.array(y_t), clf=clf, legend=2)plot_classification(model_bal,X_test_sfs_scaled,y_test)

To learn more about Yellowbrick and mlxtend, see their documentation:

Yellowbrick: Machine Learning Visualization - Yellowbrick v1.1 documentation

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests…

www.scikit-yb.org

Home - mlxtend

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.

rasbt.github.io

Before You Go

Thanks for reading! Feel free to apply this methodology to your classification problems. If you have any difficulty or any doubts kindly comment below. Your support is always highly appreciated. If you want to get in touch with me, reach me on jatin.kataria94@gmail.com.

Pulsars Detection, Hyperparameter Tuning and Visualization

jatinkataria94/Pulsar-star-detection

Contribute to jatinkataria94/Pulsar-star-detection development by creating an account on GitHub.

jatinkataria94/Pulsar-star-detection

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Prerequisites

Importing Packages

Automated EDA for Classification

Exploratory Data Analysis made simple in few lines of code!

Loading the Data

Run the automated EDA program

RadViz plot

Feature Importance

Hyperparameter Tuning

Fitting and Visualizing the data

Rectifying Class Imbalance

Decision Region

Yellowbrick: Machine Learning Visualization - Yellowbrick v1.1 documentation

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests…

Home - mlxtend

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.

Before You Go

Written by Jatin Kataria