Exoplanet Exploration with F# and MongoDb: Part 1

Introduction

To celebrate my acceptance into UWashington’s Data Science Certificate program, I decided to brush up on my F# skills and apply it to multiple studies of Exoplanets.

Pillars of Creation: Where it all began

Through this blogpost, I plan to highlight how I acquired and persisted the Exoplanetary and Stellar data via MLab and MongoDb.

Additionally, I describe the first study I carried out with this data that included generating basic descriptive statistics of the Exoplanetary Mass using MathNet.Numerics and creating some plots based on the data via XPlot.GoogleCharts.

Data Acquisition

Our data is acquired from the catalog offered from Exoplanet.eu available here in CSV form.

We make use of the CSV type provider in the following manner to get the collection of exoplanets.

Our downloaded CSV has many columns and a lot of rows all pertinent to Exoplanetary and Stellar data. Our first step is to sieve out exactly the columns we want to use for our studies.

Modeling the Domain

From the catalog, we gather details about the units of the fields:

Stellar:
Radius in terms of Solar Radius
Mass in terms of Solar Mass
Temperature in Kelvin
Planetary:
Radius in terms of Jupiter Radius
Mass in terms of Jupiter Mass
Temperature in Kelvin
Orbital Period in Days

To be as scientifically close as possible, we employ the use of an awesome feature built in to the F# language called Units of Measure; more information of which can be found here.

In a nutshell, these provide an extra layer of type safety only allowing similar units to be mathematically combined.

In tandem with our data, our units of measure definitions look like:

The relationship between the Jupiter and Solar mass and radii is highlighted by the following function:

Our domain consists of a Planet record type and Star record type. The Star record is embedded in the Planet record similar to how the catalog is structured. We’ll explore in a later blogpost how to change this to a more normalized form but here is what our domain looks like:

The BsonObjectId is going to be used to generate ids for our MongoDb collection. Our sequence of Exoplanets can be generated from the CSV file in the following manner:

For giggles, we print out all the freshly baked records via:

An example of which looks like:

{Id = 59828c032998980c8b42b749;
Name = "ups And c";
OrbitalPeriod = 240.937;
Mass = 9.1;
Radius = nan;
Temperature = nan;
Star = {Id = 59828c032998980c8b42b74a;
Name = "ups And";
Radius = 1.631;
Mass = 1.27;
Temperature = 6212.0;};}

Ugh, we see a bunch of nans indicating the data simply doesn’t exist from the catalog. Turns out if we sieve the nans, we are left with almost no data; we’ll leave them in and selectively filter out the nan records in accordance to which field we consider in our study.

Persistence via Mongo

We’ll use MLab for our MongoDb storage needs. Our first step is to create a database for our planetary studies. That can be done easily by choosing “Create New” on the MLab home screen after setting up an account.

Create a new Database

We’ll name the database “astrosharp” in the free tier with the appropriate fields based on your location. Subsequently, we create a new collection called “Exoplanets” and add a new Database user with password; this will give us all the information needed for our connection string.

Create a new Collection
Add Database Users
Grab the Connection String based on the Db User and Password

We make use of the connection string to connect to the astrosharp database, get the Exoplanet collection and add our newly created exoplanet records in the following manner:

If all goes well, we see our records added to the database:

An example of a document looks like:

{
"_id": {
"$oid": "59827b462998980c8b426211"
},
"Name": " EPIC 22881391 b",
"OrbitalPeriod": 0.179715,
"Mass": 0.7,
"Radius": 0.079,
"Temperature": "NaN",
"Star": {
"_id": {
"$oid": "59827b462998980c8b426212"
},
"Name": "EPIC 228813918",
"Radius": 0.442,
"Mass": 0.463,
"Temperature": 3492
}
}

Exoplanetary Studies

Once our data has been successfully persisted, we want to be able to create cleaned data sets that we’ll run some descriptive statistical methods on and generate some plots.

For this we define a DescriptiveStatistics record type and make use of the MathNet.Numerics library for the following statistics:

  1. Mean
  2. Median
  3. Variance
  4. Min
  5. Max

in our various studies.

In addition to getting these descriptive statistics, we want a way to visualize our results for which we’ll be using XPlot.GoogleCharts.

To compare different masses with that of Earth, we add a new unit of measure called EarthMass and provide a conversion function.

Exoplanet Study 1 : Exoplanetary Mass

Cleaning The Data

As mentioned before, we want to get rid of the pesky nan masses littered across our dataset for both the exoplanet and its corresponding star.

Descriptive Statistics Results

We generate our descriptive statistics by defining a function that gets these statistics based on the mass called computeDescriptiveStatisticsForMass.

{Mean = 4.137164328;
StandardDeviation = 9.372201446;
Median = 1.05;
Min = 6.3e-05;
Max = 93.6;
Count = 1415;}

It’s evident from the difference between Median and Mean that there are a considerable number of outliers and therefore the distribution is far from a bell curve.

Plots

  1. Histogram

Let’s first take a look at the frequencies of the Exoplanetary masses using a histogram.

Dat Skew. The distributional frequency is positively skewed implying that there are a lot more exoplanets in this catalog who mass is close to that of Jupiter.

2. Pie Chart

Next, let’s compare the mass of the Earth to the that of the parents via a Pie Chart indicating how many planets are greater than in mass to that of the Earth.

Woah dude, according to this catalog there are a lot more planets out there with larger masses than the Earth!

3. Scatter Plot

Let’s create a scatter plot of the Stellar Mass in Solar Mass on the Y axis and Exoplanetary mass in Jupiter Mass on the X Axis.

Pretty interesting that a majority of the data points stay very close to the origin. Additionally, It’s rare to find an ultra massive star for large planets [ ??? might need some Physics here to prove this thoroughly ].

Conclusion

Through this blogpost, we were able to use MongoDb to save Exoplanetary data and use some basic statistical analysis to create a study on the mass of Exoplanets. It’s definitely fun to try to come up with observational theories about the data despite operating on a sample set.

I plan to write about some other more interesting studies based on the same Exoplanetary catalog as a future blog post. Needless to say, this ability to extremely simply generate descriptive statistics and concoct plots is simply:

1oo% Awesome
Show your support

Clapping shows how much you appreciated Moko Sharma’s story.