Pulsar Candidate Classification

“But we soon realized that what we were actually saying is that it’s small in size and big in mass — hence, a neutron star or a white dwarf” — Jocelyn Bell Burnell

Sowmya Krishnan
The Startup
7 min readSep 27, 2020

--

Two years of grueling labor led Jocelyn Bell Burnell and Antony Hewish to build the first ever radio telescope at the Mullard Radio Astronomy Observatory. Spread over four and a half acres of land, this array consisted of 2,048 dipoles that constantly surveyed the sky at a time resolution of about 0.1 second. The telescope was specifically built to study quasars, now known as super massive black holes. Bell achieved immediate results as she discovered abundant quasars with this telescope. But on August 6, 1967, Bell observed and logged for the first time, an odd stretch of data occupying about 5 millimeters in the 500 meters of paper readouts.

Source: https://blog.csiro.au/pioneer-pulsars-pops-into-parkes/

When the signal re-appeared in subsequent logs, Bell concluded them as radio waves and reported the same to Hewish.

“His first reaction was that it was manmade,” she recalls. “He came out to the observatory the next day at the appropriate time and saw [it] with his own eyes and began to believe much more that this was not terrestrial — that it was moving with the stars.”

Burnell spotted the radio signal in November of 1967 again and this first known pulsar was ultimately termed as Little Green Man (or PSR B1919+21). With a heartbeat that lasted only 0.3 seconds, the rapid rise and fall of the pulse meant the size of the source was also small. But the pulse also repeated every 1.337 seconds with extreme regularity which meant the source had to have great reserves of energy.

That was like saying the source was big and small at the same time!

Eventually Bell, Hewish and the team of scientists concluded that the source was small in size but big in mass. That could only mean a neutron star or a white dwarf.

The Pulsar Classification Problem

Identification of pulsars in the universe is a pre-requisite for the study of pulsars and gravitational waves. With an ever-growing amount of data, manual visual classification can no longer be relied upon to screen pulsar candidates. The emergence of machine learning and related technology has simplified such areas of astronomical research fields.

The shape or amplitude of individual pulses are different. Astronomers stack these up to create an average integrated profile of the pulse. The pulse’s arrival also varies w.r.t different times across different radio frequencies. This delay from frequency to frequency is known as dispersion. And it looks like this:

Source: https://www.researchgate.net/figure/The-effect-of-dispersion-on-a-pulsar-signal-courtesy-of-Lorimer-and-Kramer-3_fig3_271457910

The DM-SNR Curve (Dispersion Measure — Signal-to-noise ratio) is created to adjust the shape of the delay to compensate for its effect. With these two curves put together, we get the eight numerical features that a pulsar candidate can be definitively evaluated upon. These are:

  1. Integrated Profile:
    a) Mean
    b) Standard deviation
    c) Kurtosis
    d) Skewness
  2. DM-SNR Curve:
    a) Mean
    b) Standard deviation
    c) Kurtosis
    d) Skewness

Machine Learning for Pulsar Identification

In this particular case, I’ll explain the basic machine learning theory associated with pulsar candidate recognition. You can find the repository for the complete code here:

The dataset used for this classification problem can be found here: https://archive.ics.uci.edu/ml/datasets/HTRU2

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators.

Import the necessary libraries:

Let us explore the first five rows of the dataset.

Knowing our target variable, let us explore the difference or proportion between the values.

We get the following plot.

From the graph above we see that approximately the 10% of our samples are real Pulsar Stars while the other 90% detected signals are most probably noise. A heatmap construct of the dataset shows us the co-relation between different features of the dataset.

We can see that four of the eight features in our dataset have a positive co-relation with our target variable while the other four have a negative co-relation. This is helpful as the separation between the classes become clear from this plot.

To make sure our hypotheses of the dataset being easily separable for the majority of features is true, we’ll now construct a pairplot between features to confirm this visually.

The dataset can now be split into test and train groups with input dimension as 8 and output as 1 (class).

KNN Classification

One of the most known and effective methods of supervised classification is K-nearest neighbor classifier approach. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

Random Forest Classification

Another popular supervised ML algorithm used to solve classification problems is Random Forest. Each tree in the random forest is given a subset of the complete training dataset from which to learn and uses a subset of the features to represent the data. (also known as bagging)

Decision Tree Classification

A predictive modeling tool, decision trees can be constructed by an algorithmic approach that can split the dataset in different ways based on different conditions. Another form of supervised machine learning.

PERFORMANCE METRICS

Theory for this particular section has been outlined in my previous post (https://towardsdatascience.com/multivariate-logistic-regression-in-python-7c6255a286ec)

Following are the precision, recall and f-score values for the various models discussed above:

Confusion matrices for the four models are as shown below:

For a more visual representation of the matrices, we’ll use the ROC (receiver operating characteristic) curve. Use the predict_proba function to predict the probabilities of each target class separately.

We see there is a mismatch between the micro average and the macro average curve here. This indicates a class imbalance and a higher number of misclassified data points from the minority class.

Conclusion

One of the things that has come out with the discovery of pulsars is that we now have more knowledge about the space between the stars. With an arsenal of techniques to study and identify pulsars, we are yet to arrive at a definitive model to predict results more confidently while dealing with the characteristic complexities of these neutron stars. The dataset used in this particular problem is highly imbalanced with only about 10% samples belonging to pulsar class. Weight adjustment and sampling techniques are required to further reduce this imbalance. Though for a first case scenario, the models presented here do the required job.

While working on this dataset, I came across terms like‘Integrated Profile’ and ‘DM-SNR Curve’ and the numerical features associated with the same, namely — mean, deviation, skewness and kurtosis. With probably more background of the significance of these metrics and associated mathematics, I hope to touch upon the same in forthcoming articles.

--

--