Planet Finding App
Supervised Machine Learning models are trained on a combined data-set from CalTech and STScI to identify probable planet baring stars in NASA’s TESS satellite light-curve data.
NASA’s TESS satellite is a started taking light readings of 200,000 of the brightest, closest stars in 2018. They expect to find about 300 earth-sized exoplanets within the next year. Data from this on-going project is being made available to the public. For my very first co-op project, another Data Science student and I paired up to see if we could find anything interesting.
To be clear, neither of us are astrophysicists. Nonetheless, after brainstorming together, we came up with a way for a way that machine learning might contribute to NASA’s project without any reference to physics. To train a classification model, all we needed was a set of data to train on with and without exoplanets —think hotdog / not a hotdog from the show Silicone Valley. We had access to the the raw data without any classification. If we could get a list of stars in the data that had already been established as likely to have planets, then we were in business.
As it turned out, CalTech had already begun composing a list of most likely stars in the TESS data to have planets. They wanted to get this information out there to direct the attention of the physics community towards the stars that were most likely to yield interesting results, and to get people to start pointing equipment in the right directions. As it so happened, this is exactly the information we also needed to build a classification model.
We combined the CalTech data with data from STScI in Baltimore, MD, together the data formed X and y vectors of our models. We also found a number of helpful Jupyter Notebooks on the community resources section of NASA’s website to create nice light-curve visuals like the one below.
Essentially, our model took in the whole swath of these scattered data-points for each planet and was charged to produce a single probability for whether each star had at least one planet. Unlike the graph above, we were not finding a fit curve, or doing any real mathematical analysis. The power of this approach, though it could never stand alone, is that it allows us to look for patterns that were not necessarily looking for explicitly. While initially physics might find probable planets in the data using rigid and well defined rules, our secondary / piggy-back approach allowed us to look for general patterns. Most notably, we are not really able to say exactly what the patterns are that our model is picking up on, only that it must be catching on to something if it is predicting accurately.
PART 1: GETTING DATA AND VISUALIZING IT
Getting data from CalTech:
import pandas as pd
# Getting labelled TESS Objects of Interest df from Caltech:
toi = pd.read_csv('https://exofop.ipac.caltech.edu/tess/' +
'download_toi.php?sort=toi&output=csv')
# Isolating columns of interest:
toi = toi[['TIC ID',
'TOI',
'Epoch (BJD)',
'Period (days)',
'Duration (hours)',
'Depth (mmag)',
'Planet Radius (R_Earth)',
'Planet Insolation (Earth Flux)',
'Planet Equil Temp (K)',
'Planet SNR',
'Stellar Distance (pc)',
'Stellar log(g) (cm/s^2)',
'Stellar Radius (R_Sun)',
'TFOPWG Disposition',
]]#some cleaning
toi.columns = toi.columns.str.replace(' ', '_')
Getting data from STScI
from astroquery.mast import Catalogs
from astropy.table import Table
from tqdm import tqdm
# Getting additional data on TESS Objects of Interest from STScI:
tic_catalog = pd.DataFrame()
for tic_id in tqdm(toi['TIC_ID'].unique()):
row_data = Catalogs.query_criteria(catalog="Tic", ID=tic_id)
row_data = row_data.to_pandas()
tic_catalog = tic_catalog.append(row_data)
tic_catalog = tic_catalog.reset_index(drop=True)
# Renaming ID column to make this consistent with Caltech TOI dataframe:
tic_catalog = tic_catalog.rename(columns={'ID': 'TIC_ID'})
# Isolating columns of interest:
tic_catalog = tic_catalog[['TIC_ID',
'ra',
'dec',
'pmRA',
'pmDEC',
'plx',
'gallong',
'gallat',
'eclong',
'eclat',
'Tmag',
'Teff',
'logg',
'MH',
'rad',
'mass',
'rho',
'lum',
'd',
'ebv',
'numcont',
'contratio',
'priority']]
tic_catalog.columns = tic_catalog.columns.str.replace(' ', '_')
Getting dataproduct URLs for stars in the dataframes above cataloged by TIC id from STScI:
from astroquery.mast import Observations
# Getting all dataproducts for TESS Objects of Interest from STScI:
dataproducts = pd.DataFrame()
for tic_id in tqdm(toi['TIC_ID'].unique()):
row_data = Observations.query_criteria(obs_collection="TESS",
target_name=tic_id)
row_data = row_data.to_pandas()
dataproducts = dataproducts.append(row_data)
dataproducts = dataproducts.reset_index(drop=True)
# Renaming ID column to make this consistent with Caltech TOI dataframe:
dataproducts = dataproducts.rename(columns={'target_name': 'TIC_ID'})
# Isolating TIC IDs (target_name) and dataURL values to get associated files:
dataproducts = dataproducts[['TIC_ID', 'dataURL']]
dataproducts.columns = dataproducts.columns.str.replace(' ', '_')
To inspect the uniqueness of dataproducts available for each star (TIC id):
# Getting list of unique TIC IDs in each of the three dataframes:
toi_unique_ids = toi['TIC_ID'].unique().tolist()
tic_catalog_unique_ids = tic_catalog['TIC_ID'].unique().tolist()
dataproducts_unique_ids = dataproducts['TIC_ID'].unique().tolist()
dataproducts_unique_ids = [int(tic_id) for tic_id in dataproducts_unique_ids]
After exploring the above resulting lists, it was time to try to extract some information. Relying heavily on tutorials found through the NASA community pages (See https://github.com/spacetelescope/notebooks/blob/master/notebooks/) , the following yield all light-curves from an example star:
from astropy.io import fits
import matplotlib.pyplot as plt
# Getting urls for all dataproducts associated with TIC ID 29781292:
urls_for_29781292 = dataproducts[dataproducts['TIC_ID'] == '29781292'][
'dataURL'].tolist()
# Getting urls for all lc.fits files associated with TIC ID 29781292:
lcs_for_29781292 = [url for url in urls_for_29781292 if
url.endswith('lc.fits')]
# Describing and plotting all lc.fits files associated with TIC ID 29781292:
# See https://github.com/spacetelescope/notebooks/blob/master/notebooks/
# MAST/TESS/beginner_how_to_use_lc/beginner_how_to_use_lc.ipynb
for url in lcs_for_29781292:
fits_file = ('https://mast.stsci.edu/api/v0.1/Download/file?uri=' + url)
# print(fits.info(fits_file), "\n")
# print(fits.getdata(fits_file, ext=1).columns)
with fits.open(fits_file, mode="readonly") as hdulist:
tess_bjds = hdulist[1].data['TIME']
sap_fluxes = hdulist[1].data['SAP_FLUX']
pdcsap_fluxes = hdulist[1].data['PDCSAP_FLUX']
fig, ax = plt.subplots()
ax.plot(tess_bjds, pdcsap_fluxes, 'ko')
ax.set_ylabel("PDCSAP Flux (e-/s)")
ax.set_xlabel("Time (TBJD)")
plt.show()
For the sample star chosen for this example there were another 12 plots like the one above plotted from the code block above. Some stars had several of dataproducts, while others had none.
For most stars, the data plots below were also available. Again, information for how to do this was found from at https://github.com/spacetelescope/notebooks/blob/master/notebooks/MAST/
import numpy as np
# Getting urls for all dvt.fits files associated with TIC ID 29781292:
dvts_for_29781292 = [url for url in urls_for_29781292 if
url.endswith('dvt.fits')]
# Describing and plotting all lc.fits files associated with TIC ID 29781292:
# See https://github.com/spacetelescope/notebooks/blob/master/notebooks/MAST/
# TESS/beginner_how_to_use_dvt/beginner_how_to_use_dvt.ipynb
for url in dvts_for_29781292:
fits_file = ('https://mast.stsci.edu/api/v0.1/Download/file?uri=' + url)
print(fits.info(fits_file), "\n")
print(fits.getdata(fits_file, ext=1).columns)
with fits.open(fits_file, mode="readonly") as hdulist:
star_teff = hdulist[0].header['TEFF']
star_logg = hdulist[0].header['LOGG']
star_tmag = hdulist[0].header['TESSMAG']
period = hdulist[1].header['TPERIOD']
duration = hdulist[1].header['TDUR']
epoch = hdulist[1].header['TEPOCH']
depth = hdulist[1].header['TDEPTH']
times = hdulist[1].data['TIME']
phases = hdulist[1].data['PHASE']
fluxes_init = hdulist[1].data['LC_INIT']
model_fluxes_init = hdulist[1].data['MODEL_INIT']
sort_indexes = np.argsort(phases)
fig, ax = plt.subplots(figsize=(12,4))
ax.plot(phases[sort_indexes], fluxes_init[sort_indexes], 'ko',
markersize=2)
ax.plot(phases[sort_indexes], model_fluxes_init[sort_indexes], '-r')
ax.set_ylabel("Flux (relative)")
ax.set_xlabel("Orbital Phase")
plt.show()
This produces all of the dvt light-curve plots available for a star in the TESS data. An example dvt plot:
Finally, we could save our wrangled data in new csv files:
toi.to_csv('toi_example.csv')
tic_catalog.to_csv('tic_catalog_example.csv')
dataproducts.to_csv('dataproducts_example.csv')
PART 2: BUILDING A PREDICTIVE MODEL
Since we were going to need to get our data into SQL database, we started by pulling our csv data saved above into a pandas df, transferred it into a sqlite db, then back into a pandas df, merged on the unique star number (TIC id).
from sqlalchemy import create_engine
from astropy.io import fits
# Importing example data:
toi = pd.read_csv('toi_example.csv')
tic_catalog = pd.read_csv('tic_catalog_example.csv')
# Getting it into an sqlite database:
engine = create_engine('sqlite://', echo=False)
toi.to_sql('toi', con=engine, index=False)
tic_catalog.to_sql('tic_catalog', con=engine, index=False)# Pulling data from sql database:
toi = pd.read_sql_table('toi', engine).drop(columns='Unnamed: 0')
tic_catalog = pd.read_sql_table(
'tic_catalog', engine).drop(columns='Unnamed: 0')
# Assembling data needed for model:
df = toi.merge(tic_catalog, on='TIC_ID')
Next came a Test/Train split. In this case, because the size of the training set was relatively small, we decided against a validation split.
from sklearn.model_selection import train_test_split
# Separating confirmed planets from false positives:
df['TFOPWG_Disposition'] = df[
'TFOPWG_Disposition'].replace({'KP': 1, 'CP': 1, 'FP': 0})
# Creating confirmed planets dataframe:
cp_df = df[df['TFOPWG_Disposition'] == 1]
# Creating false positives dataframe:
fp_df = df[df['TFOPWG_Disposition'] == 0]
# Train/test split on both dataframes:
cp_train, cp_test = train_test_split(cp_df, random_state=42)
fp_train, fp_test = train_test_split(fp_df, random_state=42)
# Combining training dataframes:
train = cp_train.append(fp_train)
train = train.sample(frac=1, random_state=42).reset_index(drop=True)
# Combining test dataframes:
test = cp_test.append(fp_test)
test = test.sample(frac=1, random_state=42).reset_index(drop=True)
# Dropping columns from training dataframe that aren't used in model:
X_train = train.drop(columns=['TIC_ID', 'TOI', 'TFOPWG_Disposition'])
# Getting labels for training data:
y_train = train['TFOPWG_Disposition'].astype(int)
# Dropping columns from test dataframe that aren't used in model:
X_test = test.drop(columns=['TIC_ID', 'TOI', 'TFOPWG_Disposition'])
# Getting labels for test data:
y_test = test['TFOPWG_Disposition'].astype(int)
Finally it was time to build a Keras Neural Network model:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
# Setting up model architecture for neural net:
def create_model():
# Instantiate model:
model = Sequential()
# Add input layer:
model.add(Dense(20, input_dim=33, activation='relu'))
# Add hidden layer:
model.add(Dense(20, activation='relu'))
# Add output layer:
model.add(Dense(1, activation='sigmoid'))
# Compile model:
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
# Putting neural net in a pipeline:
pipeline = make_pipeline(
SimpleImputer(strategy='mean'),
RobustScaler(),
KerasClassifier(build_fn=create_model, verbose=2)
)
# Fitting the neural net:
pipeline.fit(X_train,
y_train,
kerasclassifier__batch_size=5,
kerasclassifier__epochs=50,
kerasclassifier__validation_split=.2,
kerasclassifier__verbose=2)
# Printing train and test accuracy:
print('\n\n')
print(f'Train Accuracy Score:', pipeline.score(X_train, y_train), '\n')
print(f'Test Accuracy Score:', pipeline.score(X_test, y_test))
After running 50 epochs, accuracy honed in on about 80% with a loss around 1.1:
In a addition to building the model itself, my DS partner and I set out to create a simple interactive app with our findings. As a minimal product, we hoped for a user to be able to select a star from the data and see a light-curve graph with our model’s predictions.
In a addition to building the model itself, my DS partner and I set out to create a simple interactive app with our findings. As a minimal product, we hoped for a user to be able to select a star from the data and see a light-curve graph with our model’s predictions.
To tackle this, I built a Python Flask App. In retrospect, we really didn’t need to reproduce all of our work in the app, but we did. I created a number of python classes in Object Oriented Programming style to most easily transfer our local work on a sqlite database to a populating our production PostgreSQL DB and used SQLAlchemy. The idea was to build something that we could come back to and build out more features.
from flask_sqlalchemy import SQLAlchemy#Instantiate
DB = SQLAlchemy()# create visualization info class
class Visual_Table(DB.Model):
id = DB.Column(DB.Integer, primary_key=True)
TIC_ID = DB.Column(DB.BigInteger, nullable=False)
dataURL = DB.Column(DB.String(100))
def __repr__(self):
return '<Visual_Table {}>'.format(self.TIC_ID)class TOI_Table(DB.Model):
id = DB.Column(DB.Integer, primary_key=True)
TIC_ID = DB.Column(DB.BigInteger, nullable=False)
TOI = DB.Column(DB.Float)
Epoch = DB.Column(DB.Float)
Period = DB.Column(DB.Float)
Duration = DB.Column(DB.Float)
Depth = DB.Column(DB.Float)
Planet_Radius = DB.Column(DB.Float)
Planet_Insolation = DB.Column(DB.Float)
Planet_Equil_Temp = DB.Column(DB.Float)
Planet_SNR = DB.Column(DB.Float)
Stellar_Distance = DB.Column(DB.Float)
Stellar_log_g = DB.Column(DB.Float)
Stellar_Radius = DB.Column(DB.Float)
TFOPWG_Disposition = DB.Column(DB.String)
def __repr__(self):
return '<TOI_Table {}>'.format(self.TIC_ID)class TIC_Cat_Table(DB.Model):
id = DB.Column(DB.Integer, primary_key=True)
TIC_ID = DB.Column(DB.BigInteger, nullable=False)
ra = DB.Column(DB.Float)
dec = DB.Column(DB.Float)
pmRA = DB.Column(DB.Float)
pmDEC = DB.Column(DB.Float)
plx = DB.Column(DB.Float)
gallong = DB.Column(DB.Float)
gallat = DB.Column(DB.Float)
eclong = DB.Column(DB.Float)
eclat = DB.Column(DB.Float)
Tmag = DB.Column(DB.Float)
Teff = DB.Column(DB.Float)
logg = DB.Column(DB.Float)
MH = DB.Column(DB.Float)
rad = DB.Column(DB.Float)
mass = DB.Column(DB.Float)
rho = DB.Column(DB.Float)
lum = DB.Column(DB.Float)
d = DB.Column(DB.Float)
ebv = DB.Column(DB.Float)
numcont = DB.Column(DB.Float)
contratio = DB.Column(DB.Float)
priority = DB.Column(DB.Float)
def __repr__(self):
return '<TIC_Cat_Table {}>'.format(self.TIC_ID)
To populate this database, I created three functions to live in a .py file to fetch data for our three tables above. The nearly 200 lines of code of these functions can be found: https://github.com/mikedcurry/TESS_Flask_App/blob/master/App_TESS/Data_in.py