Following the Light to Discover New Worlds!

Rafael Ferreira
The Startup
Published in
7 min readFeb 10, 2021

Using machine learning to detect planets orbiting stars in our local galaxy using the Kepler dataset.

Kepler-186f is the first exoplanet found to orbit its local star in the habitable zone.

Background

On April 17, 2014 the Kepler satellite, found one of the most crucial finds in modern day astronomy. It detected a planet that orbited its local host star in the habitable zone, also called the Goldilock zone. A find that lead to speculation about how many planets out there are like our own. You may be wondering how the Kepler spacecraft detected such a planet if the stars that we observe are so far away, and the planets that orbit these stars are so much smaller than their host stars.

Kepler uses a technique widely used among astrophysicist and astronomers alike called the transit method. This is when the star being observed is detected to have a dip in brightness during the period of time that it is being observed. If the the perceived dip in brightness is observed to happen cyclically, then astronomers can say with confidence that a planet is orbiting said star.

The animated picture shows what the Kepler spacecraft observes. Showing three planets orbiting this star and how the dip in brightness reflects their presence.

The above image shows just how Kepler detected these various stars. Kepler detects the FLUX or brightness of the star. If the brightness of the star suddenly drops during the observation then the assumption here is that an object is orbiting that star, with this object most likely being a planet. In order to make interpret the differences in FLUX, we first need to understand what it means to have a detection of a planet. This is where machine learning comes into play.

Machine learning provides us with many tools in detecting phenomena within large datasets, including models. In order to use them affectively, machine learning models break our observations into four separate categories. For our purposes these four separate categories are:

  • True Positives (TP): The number of observations where the model predicts there is a planet(1), and there actually is a planet (1).
  • True Negatives (TN): The number of observations where the model predicts there is not a planet (0), and there, in fact, is not a planet present (0).
  • False Positives (FP): The number of observations where the model predicts that there is a planet (1), but there actually is not a planet (0).
  • False Negatives (FN): The number of observations where the model predicts there is not a planet(0), but there actually is a planet (1).
Confusion Matrix to show how precise our machine learning model actually is

Using Exoplanet Hunting in Deep Space dataset I was able to explore this occurrence on my own. This labeled dataset catalogues fluxes for over 5,500 known stars without and with planets. I will now show you how to train a model that can be used on future observations using Python.

First, I need to import all the necessary packages:

This dataset is already split into a training and test set to check for cross-validation. I then want to explore the data to see how flux differs for stars with and without presences of planets.

Stars fluxes with the presence of planets
Stars fluxes without the presence of planets

What we see is the fluxes of stars with planets when graphed show a more sinusoidal wave shape, where as the stars without planets show a constant shape. From the image below, there is a drastic difference on the amount of flux observed with or without a planet. This tells us that the scaling for each of the two classes are drastically different due to the light being blocked from a planet in comparison to a star that does not have any planet in orbit. I did not scale for this project but in future works, scaling would be required to get both classes on the same scale and see how the difference in flux compares to the two classes.

Stars without planets show I higher value brightness in comparison to stars that do have a planet

Now that I have explored the data, I can begin using the different models to see how they well perform on the dataset.

Logistic Regression:

The first model I want to look at is logistic regression model since this is a binary classification problem. Logistic regression uses the sigmoid function to put each observation into a class of 1 or 0. Running the code below we see that our logistic regression model has an base accuracy of 56%, but we can see from our confusion matrix that we have low precision.

Confusion matrix for logistic regression

KNN:

The next model I will be using is K nearest neighbors (KNN). KNN is a model that predicts our target value by looking at the observations that are nearest in distance graphically to observation we are trying to predict. I used Manhattan, Euclidean, and Minkowski distances to see how they differ in their predictions in finding the presence of a planet.

Confusion Matrix based on KNN

Each of KNN models give an accuracy score of 99%. This is a huge improvement in comparison to my logistic regression model, but one consideration to take into account is that our KNN models did not predict any of our stars with planets in the the testing set, but it predicted all the stars without a planet. This is because of the class imbalance. The majority of the stars are associated with the class 0, no presence of a planet. This creates a bias in the results for the majority class.

Decision Tree:

Now I am going to run the data through a decision tree model. Decision tree uses recursive splitting in the feature space to group similar observations into regions. When creating predictions, it does so by associating the observation with class region it is most similar to.

After running the decision tree model I get an accuracy score of 98%. However, the logistic regression model found more TP’s. Meanwhile, KNN picks up all TNs, where Decision Tree was only able to pick out 98% of all TN. Now, lets see if resampling our datasets will change the accuracy of each model.

SMOTE:

Due to how drastic our class imbalance is, I will be using a resampling technique to create better symmetry. SMOTE creates false data using the minority class. In this case the minority class is the stars with planets, as mentioned above. Here we create a class of balance of 1 to 2, meaning the number of observations that have planets is now 2025 individual observations.

After re-running all three models, specifically using Euclidean distance in KNN, logistic regression became 54% accurate and detected 3 TP, KNN became 98% accurate but now detected 1 TP, and decision tree became 97% accurate, with 0 TP.

Conclusion:

After running all three models and seeing how the accuracy scores compare, I wanted to compare precision, recall, and f1 scores to based on the resampling of the data. The reason for doing so is because I made a mistake with using accuracy as my primary metric of how well each model performed on my dataset.

Precision, recall, and F1 score for all three models.

We can clearly see that accuracy is not the best indicator of making predictions on whether a planet exist or not. We need to take it a step further by looking at the recall metric. The recall measures the sensitivity of each model, and specifically in the context of this problem of detecting the stars with planets. Our models need to be more sensitive in detecting FN, predicting no planets when there actually is no planet present. From the recall metric, I can say that KNN was the best model in detecting stars with planets.

Further Steps:

I want to further improve this research by considering recall as metric as to how sensitive each model picks up FN. As well as, creating time stamps of each campaign that Kepler made which lasted, on average 80 days, along with taking into account the spectrum of each star to cancel out noise to output the best flux for each observation.

Repo Link

--

--

Rafael Ferreira
The Startup

Undergraduate in Physics concentration in Astrophysics. Thesis in classification of Eight Young Stellar Objects in the SMC. Data Science Bootcamp Flatiron