My First Data Science Research Paper

As a freshman in high school

Aryaman Kukal
12 min readJul 2, 2020

This was a paper I recently submitted for part of a research program I applied for and got into. Over the course of around 4 months, we will be educating ourselves with the concepts of data science and machine learning. My first project, in the field of data science, was to analyze a data set, describe it, and finally evaluate it.

The data represents 355 exoplanets in the universe, each one having its own attributes. The goal of the data is to find exoplanets that could potentially support life as we know it, but our task was not to make conclusions. It was no easy task, but it was truly an eye-opening experience for me, even though data science, to me, has many incredibly difficult concepts that I’m not sure when I’ll understand.

Regardless, the following is my paper analyzing this data set.

See the CSV here and the column definitions here.

1. Describing the Data

First, we should start off with what the data actually represents. This data consists of 355 attributes, or variables, which is the first entire row of the data, and 3,839 observations, or exoplanets involved in the NASA study. Essentially, there are 3,839 exoplanets observed, and they each have 355 attributes, but according to the data, not all the attributes for each and every planet has been found out. The purpose of the data is to use it to compare with the variables of our own planet, Earth, and see if that particular exoplanet can support life as we know it. This data represents a mixed-attribute problem, where there are multiple data types involved. There are some text attributes, such as “Radial Velocity” and “Imaging” and mostly numerical values, including floats and integers. The data has many exceptions, the biggest one being the fact that there are a lot of missing data values; which is shown by blank spaces in the comma-separated values file. For example, pl_orbincl has mostly missing spaces, meaning that the planet inclination isn’t a good attribute to make conclusions with because its data is insufficient. My method for analyzing the data is using a correlation matrix with the Kendall rank correlation coefficient, which is a matrix that shows correlation frequencies between every pair of every single variable in the data frame. Dark red represents a high positive correlation and dark blue represents a high negative correlation. More on the correlation matrix can be found in the 5th section of this paper. Note: the matrix represents the data exclusive of the exceptions of the data, or the missing values. I did not replace them with the mean.

Let’s take a look at the correlation matrix for the final data frame, which was cut down in terms of variables. This is the completely zoomed out version. Each individual box, or pixel, represents the frequency of correlation for one pair of random variables. We need to observe the darkest areas for the highest correlation. However, the diagonal line through the middle should be ignored because it represents the correlation of two of the same variables. Hence, they’re not true correlations.

Out of the entire matrix, which consists of 123 variables and 15129 cells, after cutting out a few columns (more on this in the 5th part), there are a few quite significant variable correlations that matter in the context of the problem, which is to find planets that can support life as we know it.

Correlations:

Since the goal of this data is to find planets that can support life as we know it, we need to focus on the single, the most important element for life to emerge, water. Hence, we need to observe the variables that relate to the availability of water on the exoplanet.

(left visual: zoomed-in correlation matrix, right visual: scatter correlation plot with linear regression showing correlation factor)

st_teff ~ st_sp (negative correlation)

This correlation has a factor of -0.7841562332607577 on the -1 to 1 scale. The effective temperature of the exoplanets’ stars is crucial in determining if water can be sustained on that planet, and it’s measured by the frequency of electromagnetic radiation and UV rays the star emits. Our Sun’s temperature is 5780 K, so it can be easily compared with the stars of other exoplanets to find if exoplanets within a certain range of that star are habitable. The spectral type of a star can be determined using the effective temperature, as they negatively correlate at a high factor.

pl_eqt ~ pl_radj (positive correlation)

This correlation has a factor of 0.6944035302610344 on the -1 to 1 scale. A planet’s equilibrium temperature, or the temperature when the energy coming from the star is equal to the energy emitted by the planet is another factor of temperature that determines if water can be sustained on the exoplanet. Earth’s equilibrium temperature is 255 K, so it can easily be compared with other exoplanets to find if they’re habitable. The radius of the planet can help find the equilibrium temperature, as they positively correlate at a high factor.

pl_masse ~ pl_rvamp (positive correlation)

This correlation has a factor of 0.7611043271765756 on the -1 to 1 scale. Without a certain amount of mass, a planet won’t have enough gravity to keep its water from floating off into space, so it is the base attribute of a planet that determines if water can be sustained. Pl_masse is measured in terms of Earths and because Earth’s mass is 1 Earth, it can easily be compared to any other exoplanet. The Radial Velocity Amplitude, in m/s, can help in determining the mass of an exoplanet as they positively correlate at a high factor.

st_lum ~ pl_eqt (positive correlation)

This correlation has a factor of 0.6141912624645343 on the -1 to 1 scale. Star luminosity, or the total amount of energy it emits per second, is perhaps the most important attribute of a star, in terms of the star and the planets in its system. The amount of energy and brightness it emits determines if water can be sustained on an exoplanet. Our Sun’s is 1 L, so it can be easily compared with other stars. A plant’s equilibrium temperature can help us determine a star’s luminosity because they positively correlate at a high factor.

pl_dens ~ st_bmy (positive correlation)

This correlation has a factor of 0.9842573344657409 on the -1 to 1 scale. Since a small planet experiences less gravitational compression than a larger planet, it is key in determining if a planet can support life as we know it. Earth’s density is 5.51 g/cm³, so it can easily be compared to an exoplanet in the data. The exoplanet’ stars’ color is measured by the difference between b and y (Stromgren) can be found out using the exoplanet density, as they positively correlate at a high factor.

2. Strengths & Shortcomings of the Data

Let’s start off with the strengths.

  1. It is quite obvious that the larger the data frame is, the more accurate the results because more observations and variables were taken into account. The exoplanet includes 3,839 observations, or exoplanets, and 355 different attributes, or variables, for each planet taken into account. We can better predict which exoplanets are inhabitable because we can compare 355 individual attributes of each exoplanet to Earth’s. That is quite a large advantage. The model that works with this data will work very well. Having more data also allows us to identify anomalies in the data faster; anomalies such as outliers, because we have more data to see where the general trend lies.
  2. This data represents a mixed-attribute problem, meaning there are 2 or more types of data. In this case, there is numerical data, which makes up most of the data frame, and text data, which only makes up the first few columns, but is also apparent. The more variation there is in the data, the stronger it is because it takes many different types of variables into account.
  3. The final strength of the exoplanet data is the fact that it came from a very reliable source. We should always know where the data came from. What its source was. What instruments were used to get the data. Who was involved in its collection. In our case, the data came from a NASA research project, a well-known space exploration company that we can fully trust.

Moving on to the shortcomings.

  1. There is a large number of missing data values in the data frame. Some variables have only 5 out of the over 3000 values required. This is quite problematic because we might not have enough data to achieve what we are trying to achieve, that is if the variables have a large significance in the data. Missing data is instinctively never a good thing. The more data there is, the more accurate the results or predictions will turn out.
  2. There is too much data. This is an advantage as well as a disadvantage. The more data you have, the more of a hassle it is to work with the data, the longer it takes to process the machine learning model, and the harder it is to understand the data fully. With 355 different variables and 3,839 observations, it is not reasonable that we, as the analyzers of the data, can understand to the greatest extent, what each and every variable represents and its significance to what we are trying to achieve.

3. Exceptions

This data frame has quite an obvious exception, and it is the only one that we can find out because we don’t know how the data was exactly collected and turned into a data set. Hence, we have to rely on plain visuals to look for anomalies in our data. The biggest exception in this data is the fact that there is a large number of missing values or empty boxes. In the exoplanet data, we can clearly see that variables such as pl_denserr1 have large gaps of missing data in between a few data values here and there. There are only 6 columns in the entire data frame that are 100% filled up. This could impose many issues. First, most data processors automatically eliminate blank values, missing data. This can turn into a big hassle. This means that in the end, you may not have enough data to perform the analysis. The analysis you wanted to run might not be possible to run under the inadequate data. Second, even if the data produces output, the results may not be statistically significant because of the small amount of input data, and a large number of missing values.

4. Options for Dealing with this Exception

  1. Dropping Variables: This isn’t the best solution because losing valuable data is the worst thing that can happen. That is, if the variable you’re dealing with is valuable. However, if a lot of data is missing, and you feel the variable isn’t too significant in terms of helping you reach what you’re trying to prove, dropping the variable entirely should be an option.
  2. Imputing the mean/median/mode: We can also calculate the median, mean, or mode of that certain variables’ data, whichever one is most representative of the data. This is a good method because it is fast and can be done easily and efficiently.
  3. Dropping observations: If there isn’t a lot of missing variable values for a particular observation, you can drop the observation entirely, and not consider it any final outcome. However, this has the same potential consequences as dropping variables. Losing potentially significant data is never a good thing.

5. My Approach to Analyzing the Data

Dealing with the exception: My preferred approach to dealing with missing in this data set is to implementing the mean instead of missing values. The mean will represent the calculated average of every column that has numeric values. The resource I will use to analyze and alter the exoplanet data is a python library called Pandas. The issue of missing data can be solved with a simple Pandas method. The fillna() method is used to fill NaN values using a certain method, in place of the NaN value. In our case, we have to find all the NaN values, calculate the mean of the column the NaN value is in, if it’s numerical, and replace that NaN value with the mean. This takes place throughout the data frame. This is the safest way to deal with the exception of missing values because there is no risk of losing potentially valuable data.

This is the first 10 rows of the exoplanet data (not all columns are shown). There are few NaN values in st_c1err and st_c1lim.

Using this code in Pandas, every NaN value will be replaced by its column’s mean.

The result is as shown. The NaN values disappear and are successfully replaced.

Analyzing the data: I prefer using the Pandas library for data analysis as well, run in PyCharm. My method for analyzing this data is mainly based on one type of visual representation. It is in the form of a type of “heat map” chart called a correlation matrix. It’s a table showing correlation coefficients of all the variables, in pairs. Each and every variable in your data frame is paired with every other variable in a matrix which allows you to see which pairs have the highest correlation. The correlation frequency is measured by color. In our case. Dark red represents a high positive correlation and dark blue represents a high negative correlation, as shown on the left. This certain type of correlation matrix is measured with the Kendall rank correlation coefficient, which is on a scale of -1 to 1. Before creating the chart, however, we need to cut down the data, or it will be too hard to observe. There are many repeated variables, not exactly duplicates, but repetitive variables. For example, the variable pl-masse has the variables pl_masseerr1, pl_masseerr2, and pl_masselim, which represent error (err) and limit values (lim), respectively. These are not important, so I dropped those columns entirely with this code. Note: the matrix represents the data exclusive of the exceptions of the data, or the missing values. I did not replace them with the mean.

We want to find certain pairs of variables that correlate at a high frequency and contribute greatly to the goal of the data, which is to find exoplanets that can support life as we know it. Major correlations were mentioned in the first section of this paper.

6. Identifying & Describing the Weaknesses of my Approach

Let’s go over some disadvantages of my approaches to analyzing this data.

Dealing with the exception: Imputing the mean/median/mode only works on numerical data, which is a big limitation because our exoplanet is a mixed-attribute problem. It also decreases the variance of the data overall because many values will be the same, which is never a good thing because less variant data leads to higher chances of inaccurate results or predictions.

Using a correlation matrix: A correlation matrix’s coefficients that make up the visual representation only measure the linear relationship between x and y and for any relationship to exist, any change in x has to have a constant proportional change in y. If the relationship is not linear then the result is inaccurate.

Using Pandas: The only disadvantage with Pandas is that it has a memory which is quite easy to fill. Since this is true, Pandas, along with Python in general proves to be slow at times.

And that concludes the paper. Stay safe everyone!

--

--

Aryaman Kukal

tech enthusiast ~ programmer ~ avid writer ~ always curious; rarely express it ~ freshman @ American High ~ “Make it work, make it right, make it fast.”