Predicting Pulsar Stars

Published in

Analytics Vidhya

4 min readSep 17, 2019

Astronomy and astrophysics has always been a pet interest of mine, so I jumped at the chance to work with data from the High Time Resolution Universe Survey. The data set I worked with can be found here on Kaggle.

Schematic of a pulsar with the jet in blue

What’s A Pulsar?

A pulsar is a type of stellar remnant,usually a neutron star, formed from the collapsed core of a giant star (a type of star 10x to 29x the mass of the Sun). They have very strong magnetic fields and emit pulses of electromagnetic radiation out like a lighthouse, but we can only detect them when it is angled at the Earth.

Why?

Pulsars are not very common and automating the process of identifying them would help astronomers and physicist study them: they have also been used to study nuclear physics, General Relativity, and they even helped prove the existence of gravitational waves.

Cleaning & Preprocessing

The data was surprisingly clean: there were no missing values! The only real cleaning step I took was to shorten the names of the columns because they were so long.

When visualizing the data, I noticed that there are some very strong correlations in the data, both negative and positive, which informed how I approached the preprocessing work that I did.

There were two parts to my preprocessing: manipulation of the data and creation of interaction columns. The first was easy enough: some of the data is close to being normally distributed, so I just squared the values to make them more normal. The second was creating interaction columns: there was very strong correlation between mean & standard deviation and between the skew and excess kurtosis of each data type (the integrated profile and the DM-SNR curve).

Modeling

import pandas                as pd
import numpy                 as np
import matplotlib.pyplot     as plt
import seaborn               as sns
from keras                   import regularizers
from keras.models            import Sequential
from keras.layers            import Dense
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics         import roc_auc_score
from sklearn.metrics         import balanced_accuracy_score
from sklearn.metrics         import f1_score, confusion_matrix
from sklearn.metrics         import recall_score
from sklearn.preprocessing   import StandardScaler

My imports were fairly straight-forward and mostly were for the modeling process.

The neural networks are composed of three basic layers:

Input layer: the features (columns) from the data
Hidden layer: the part of the model that makes predictions; linear models are modified with weights and biases and then an ReLU activation function which forces our predictions to be positive or zero.
Output: the results of the model. A sigmoid function (like a logit from a logistic regression) “bends” the predictions to be 0 or 1.

Because of how I created the interaction columns, I created three subsets of the data and modeled on each; there is only one model, so which ever subset has the best scores will be the best model.

Evaluation & Best Model

My best model was the neural network I ran with the interaction features. That being said, the difference between the models was minute.

Each of the models was evaluation on five models in total:

Accuracy: how many predictions were correct
Specificity: of all non-pulsar predictions, how many are correct
Sensitivity: of all pulsar predictions, how many are correct
Matthews Corr. Coef.: how similar the predictions and true values are

I optimized for false positives because pulsars are very important to astronomers and thus a false negative is much worse than a false positive. As it turns out, the models as a whole had very few false negatives. The accuracy score is much better than the baseline, which is due to the fact that the baseline is 9.1% and because the negative class was predicted better than the positive. The Matthews Correlation Coefficient shows that the predictions and true values are very similar.

The ROC curve plots the neural network’s ability to distinguish between pulsars and non-pulsars. The curve itself shows the relationship between sensitivity and and false positives. However, more important is the AUC (area under the curve) because it shows the distinction between both classes. The lowest possible score is 0.5 and my best model’s score is 0.91437, which is which indicates a high level of performance.

Conclusions

I was able to predict pulsars satisfactorily, but there is still room for improvement: these stars are important to science so having a higher sensitivity is important.

I would like to continue with the interaction features, regularization, and experiment with with different sampling techniques: either down-sampling the majority or up-sampling the minority.

This project was a lot of fun to work on because astronomy and astrophysics has always been a pet interest of mine and because of how important these stars are to science.

The repository for the project can be found here.

I can also be reached on LinkedIn.