ExoNET: discovering exoplanets through Deep Learning
A Convolutional Neural Network for classifying planetary transit data
Four weeks! Four weeks, to design, build, and execute a project aimed to push the limits of what we’d learned in our local chapter of AI Saturdays in Madrid. Naturally, four weeks doesn’t sound like a lot of time because, well… it isn’t. Many ideas were proposed for groups to form, but in the end, the group writing this article chose to work on astronomy data.
We’re a professionally diverse group made up of Borja, our resident astrophysicist; Austin, a neuroscience Ph.D. student; Enrique our wizard programmer (reconverting Java architect); and Miguel, aerospace engineer by day and Deep Learning student by night. In the beginning, the idea was just that: Astronomy. Completely undefined, but with the idea of working with the vast amounts of data available from astronomical observations.
In the end, we were inspired by a wonderful paper by Shallue et.al (a Google AI and UT Austin collaboration) in which deep learning techniques were used to discover two new planets. We set out to build upon this work. From then on, our challenge was clear: create a deep learning model to identify the most likely planet candidates from Kepler data.
Exploring the Final Frontier
The Kepler space telescope was launched into heliocentric orbit in 2009 and was recently retired in 2018. Throughout its years of service, Kepler propelled a revolution in the detection of exoplanets — that is, planets from other stellar systems — and enabled the discovery of thousands of them by making use of the planetary transit method.
The planetary transit method? Let’s imagine a single star with one hypothetical planet in its orbit, and that the planet’s orbital plane intersects with our point of view in the Solar System. Continuous imaging of the light coming from the star means that, at some point, when the planet passes in front of the star (called a transit), the intensity perceived by the telescope will decrease, as the planet partially eclipses the light from its star.
This transit will happen again once the planet completes a full orbit around the star, which takes a fixed amount of time called the orbital period. If the telescope monitors the same star for enough time, it will perceive multiple transits evenly spaced in time. The repeated detection of these transits enables the discovery of new exoplanets, although confirmation often requires detection through a second method.
In the first part of its mission, lasting more than 4 years, Kepler recorded photometric data from over four million stars. This massive amount of data has been made available to the public through the NASA exoplanet archive.
Thankfully, the Kepler preprocessing pipeline automatically identifies candidate signals for planetary transits by looking for repeated events evenly spaced in time. However, it is far from infallible. There are many other types of events that can pass this first filter, such as measurement artifacts, eclipses from another star in a binary system, intrinsic fluctuations of the star’s brightness, or events in other star systems aligned with the candidate star from our point of view. In particular, some binary eclipses can look a lot like planetary transits from just a light curve, and can very well fool even an astronomer at first glance.
As a result, careful and often arduous vetting of these candidates is needed before committing resources into confirming a discovery. This is where we believe that machine learning can make a valuable contribution to the field.
This idea is not new. Shallue et.al used heavily processed photometric data and specifically light curves from confirmed planetary transits to train a Convolutional Neural Network (CNN). Can we do the same in four weeks?
Taming light curves
Now that we’ve gotten the background out of the way let’s get to the nuts and bolts. We had four weeks to obtain data, clean it, and implement a state of the art Convolutional Neural Network capable of discriminating between true exoplanets and false positives.
Our data comes in two parts, a table filled with label Kepler Object ID (KOI; one for each star), the associated period of orbit for objects around the KOI, and its exoplanet status: confirmed or false positive. These will be our labels. The second part of our data is the light curve itself, which will be the input to the model. The light curve is made up of the Kepler photometry data (continuous photos every 29 minutes).
Luckily, there is an incredible python API for Kepler which makes citizen science accessible. Using the aptly named Lightkurve API we can download a light curve using the KOI.
Unfortunately, we can’t use the light curve in its raw form. We will not get into specifics, but we had to go from the mess you can see above and shift, concatenate, normalize, fold and rebin the light curve, in order to get the pretty view you see below, where the dimming caused by the planetary transit is clearly visible.
This view has two parts. The global folded light curve shows the full orbital period of a potential planet, while the local folded light curve only shows the time around the transit, although at a much higher resolution.
We applied this process to all the labeled data from Kepler in order to build our training, test and, validation. In total, we had 2296 light curves from confirmed planets and 4841 from false positives.
How to train your ExoNET
Our model, which we called ExoNET, is a Convolutional Neural Network implemented in PyTorch.
We chose a CNN because of its capabilities for identifying complex features in the data, such as the presence of a transit, by building up from simpler patterns. They also have a lower risk of overfitting than fully connected neural networks, which is a major concern with the type of training set we have.
In fact, ExoNET incorporates two one-dimensional CNNs, one for the global view of the light curve, and another for the local view. These extract features from each of the views and then four fully connected layers act as the classifier, assigning a likelihood that there is a planetary transit in the input light curve. Also, the network incorporates dropout and batch normalization layers to minimize any overfitting.
We used a train/test/validation split of 70/20/10. The training set was used for optimizing the weights in the network (using a simple SGD algorithm with momentum) and the test set was used for hyperparameter optimization. Finally, the validation set was used exclusively for evaluation, and no further changes were made to the model based on it.
The confusion matrix for the validation set is shown on the right. Overall, choosing an arbitrary threshold of 0.5 for classification, we achieved an accuracy of 95.2%.
This can change if we move the threshold up or down, depending on whether we put more value on not classifying any planets as false positives or any false positives as planets. The precision vs recall curve below shows this tradeoff as the threshold value changes.
However, the more appropriate measure for this model is the AUC (area under the ROC curve), which is the expectation that a randomly chosen confirmed planet is ranked higher by the model than a randomly chosen false positive. We achieved an AUC of 98.5%, just shy of the state of the art, at 98.8%.
Now that we have trained and tested our model, let’s try to use it to explore strange new worlds.
We further applied our light curve processing pipeline to an additional 2420 unconfirmed planets candidates of interest present in the Kepler catalog. This last set of planet candidates is our model’s time to shine. Can we determine which of these candidates are most likely to truly be exoplanets?
The model gives us a number between zero and one. The higher the number, the greater the likelihood that the signal corresponds to an exoplanet according to the model.
The candidate with the highest score is K01206.01, at 0.940. Close behind is K01861.01, at 0.927. This information could potentially be used to prioritize observations on their stars in order to confirm the discoveries.
Conversely, the model can also be used to automatically discard the unlikeliest candidates, such as the examples shown below, which have the lowest scores.
… and beyond
While we are happy with the results we got in just four weeks, we have many ideas on how to expand upon this work in the future:
- We want to add local views in the opposite phase to the transit to catch more information on secondary eclipses from binary stars, which could be used to discard false positives more efficiently.
- It would also be interesting to include information on the position of the centroid of the incoming light intensity as a secondary channel, which would also help discard false positives since the centroid should not move perceptibly during a planetary transit
- With more time, we would also like to experiment with adding other parameters such as the absolute transit durations or the orbital period as input for the dense layers of the model.
- And, of course, we want to apply this model to more unlabelled data. The TESS mission was launched last year and, just like Kepler, it is searching for exoplanets using the transit method, although on an area of the sky 400 times wider than that of its predecessor. Our model would be great for sifting through this massive amount of data.
Overall, this has been a thrilling learning experience for the four of us, in astronomy, data science, computer science, and deep learning. We all come from very different academic backgrounds, but we wouldn’t have it any other way, because this combination of disciplines is what has allowed us to learn from each other and grow together.
We would like to thank the organizers of AI Saturdays for their amazing initiative and we invite you to check out our repository on GitHub.
Thank you for reading.