Ariel Goldberger
Jan 4 · 7 min read

This article will show the reader how using the integrated electromagnetic radiation profile of the pulsing neutron star and algorithms for data analysis can help classify these dense rotating stars across the universe.

The basics of Pulsars

After a high mass star collapses into its own gravity and superheated plasma is expelled at high speeds (a phenomenon we call Supernova) an enormously dense object made of neutrinos is formed. During the collapse of the star, protons, electrons, and other subatomic particles compress together, and they merge to form neutrons. A neutron star should have a mass of the entire sun packed into a sphere around 20 km across. Now, if you take a spinning object and you shrink it, the number of rotations will increase, just like an ice skater crossing her arms while spinning. The same is true for the star’s core, the rotation of the astronomical object should be several times per second. Its electromagnetic field will also increase dramatically. Because of the rapid rotation and the strong magnetic field, beams of energy launch away from the star like the beams of a lighthouse, which from Earth we can see as a blink or a pulse. The beams normally are part of the invisible spectrum of the radio waves (mainly gamma rays).

Simulated pulsar — by NASA

Fortunately, our telescopes can capture these signals and show us the unique radio pulse profile of the stars. While single pulses (pulse emission from each individual rotation) are highly variable. Integrated profiles are consistent and stable in time. These metrics will be our first set of observations. Our second set of data will correct our metrics by taking into consideration the effect of the interstellar medium, Dispersion Measure(DM) or in other words, they will allow us to distinguish pulsars from other radar-microwaves generating noise across the almost emptiness of space.

Single and integrated profiles of a pulsar — by A. Lyne

Required and selected libraries

An R environment is required to run the following codes:

library(ggplot2) 
library(rpart)
library(GGally)
library(dplyr)
library(DataExplorer)
library(Rmisc)
library(caret)
library(e1071)
library(rpart)
library(MASS)
library(randomForest)
library(nnet)
library(pROC)

The elegance of this data science exercise lays in the simplicity of the data considering that astronomical data is usually extremely big and complex. In this case, we will be using the four first statistical moments (mean, variance, skewness, and kurtosis) of the IP (Integrated Profile) and DM. The dataset contains 17.898 observations taken from The High Time Resolution Universe Pulsar Survey I. Download the dataset here.

Data exploration

Exploring data about space is, indeed, space exploration.

# Loading data set:
df = read.csv(‘pulsar_stars.csv’)
# Renaming the columnss to a coding-friendly nomenclature
colnames(df)=c(‘mean_IP’,’std_IP’,’exK_IP’,’ske_IP’,’mean_DM’,’std_DM’,’exK_DM’,’ske_DM’,’target’)
# Setting the correct var. class
df[‘target’]=as.factor(df$target)
# Separating data for testing (75% for training)
selection_index=sort(sample(nrow(df),nrow(df)*.75))
df_train = df[selection_index,]
df_test=df[-selection_index,]
df_train['target']=as.factor(df_train$target)
df_test['target']=as.factor(df_test$target)

Notice that we are using 75% of the observations for training the models and the rest for training purposes, which means, we will not overfit models since we are testing with a different subgroup. However, note that only around a couple of thousands of the records are classified as pulsars which for evaluation purposes represents a challenge. Using relative metrics are more suitable as well as defining a benchmark. For instance, classifying all-stars as non-pulsars will give us 90% accuracy.

summary(df)

A first glance at our datasets reveals eight numeric variables and one factorial-binary variable called ‘target’ which indicates if the observation corresponds to a pulsar or not.

Summary of variables

Since our purpose is to identify the pulsars from other objects in space, we will have a first look at the distribution of the variables separated by this categorical variable. As you might recognize from the plots below, the conclusions of this first study are simple and clear:

(0) There is no missing data

# Checking data-set structure
plot_intro(df)

(1) We are facing an imbalanced dataset, where most of the observations are not pulsars
(2) Some pulsars may have very distinguishing characteristics which will allow us to easily spot them, but others can be easily confused with other celestial objects.

It is because of this hard-to-detect group that we will need to use machine learning techniques which can combine the measures taken by the radar into multiple automated rules. Using this approach will decrease the chances of making mistakes with manual rules (low accuracy) and benefit from faster computational speeds (low cost).

# Pulsars and not pulsars (distribution)
ggplot0 <- ggplot(df, aes(x=target, fill=target))+geom_bar()
ggplot1 <- ggplot(df, aes(x=target, y=mean_IP, color=target)) + geom_boxplot()

Apparently, all variables have a relevant role in their ability to identify our target variable, although the correlation between the skewness of DM and the factor variable is low, we will not ignore this variable in our models.

# Studying variables relationship
ggpairs(df, aes(colour = as.factor(df$target)))

Modeling

In case you are curious about how models work in terms of classification you can check the application I built to easily see how each model works behind the scene. (click here)

We will be testing five different models that can face the classification challenge. Some parameters have been optimized in order to have the best performance within each model. After our calculator engines are set up, we will check our test data set to find how well each model did by comparing the predicted category against the real category.

# Naive Bayes Model
mod_NB<- naiveBayes(target ~.,data=df_train)
pred_NB = as.data.frame(as.factor(predict(mod_NB, newdata = df_test[,-9],type="class")))
# Linear Discriminant Analysis Model
mod_LDA<- lda(target ~ . , data=df_train)
pred_LDA = ((predict(mod_LDA, newdata = df_test[,-9])))
pred_LDA = as.data.frame(pred_LDA[[1]])
# Random Forest Model
mod_RF<- randomForest(target ~ . , data=df_train,cutoff=c(.25,.75), ntree = 200)
pred_RF = as.data.frame(as.factor(predict(mod_RF, newdata = df_test[,-9])))
# Supported Vector Machine Model
mod_SVM<- svm(target ~ ., data=df_train,kernel="linear")
pred_SVM = as.data.frame(as.factor(predict(mod_SVM, newdata = df_test[,-9])))
# Neural Network Model
mod_NN<- nnet::multinom(target ~ ., data=df_train, maxit = 1000, trace = FALSE)
pred_NN = as.data.frame(as.factor(predict(mod_NN, newdata = df_test[,-9])))

Remember 90% accuracy is our null model benchmark. Fortunately, every model beat this benchmark. The results can be seen in the following confusion matrix and accuracy levels:

Model Evaluation

$ Checking models performance
NB=confusionMatrix(pred_NB$target, df_test$target)
NB$overall[‘Accuracy’]
AUC_NB=auc(as.numeric(pred_NB$target),as.numeric(df_test$target))
barplot(acc,xlab = 'model',ylab='accuracy',col='orange',ylim=c(0.92,1),xpd = FALSE)
Confusion Matrix per each model

We might be more interested in evaluating the result of the models by using the ROC curve and the AUC (area under the ROC curve). Since we have unbalanced data. These metrics will have higher importance for selecting the best model.

roc_NB=roc(as.numeric(pred_NB$target),as.numeric(df_test$target))
plot(roc_NB, col = 1, lty = 2, main = "NB ROC")
Receiver Operating Characteristic of mod
AUC_NB=auc(as.numeric(pred_NB$target),as.numeric(df_test$target))
barplot(AUC_NB,xlab = ‘model’,ylab=’Area Under Curve’,col=’gray’,ylim=c(0.80,1),xpd = FALSE)

The numbers above state that a random forest model with an adjustment that punishes for false observations that are classified as pulsars is the highest performer within all our five models. The result of this model contains 99 records misclassified (10 False positives) and 4376 correctly classified records. In addition, 314 pulsars were found based on the variables and model used.


Conclusion

There is an arsenal of techniques to identify pulsars, however rarely we have the chance to have a confident model that only uses a few characteristics of a more complex phenomenon like these neutron stars. This time a random forest model is capable of doing the best classification. Without a doubt, the model presented in this article could be improved with further data engineering and parameter tuning. However, as a first approach, the results are robust, cheap and fast.


My eureka moment was in the dead of night, the early hours of the morning, on a cold, cold night, and my feet were so cold, they were aching. But when the result poured out of the charts, you just forget all that. You realize instantly how significant this is — what it is you’ve really landed on — and it’s great!

- Jocelyn Bell Burnell [About her discovery of the first pulsar radio signals.]

Duke AI Society Blog

Duke Artificial Intelligence Society Blog

Ariel Goldberger

Written by

M.Sc. Quantitative Analytics, Duke University - M.Sc. Financial Engineering, Adolfo Ibanez University

Duke AI Society Blog

Duke Artificial Intelligence Society Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade