The Bubble Nebula, a supernova remnant (**Credit:** T.A. Rector/University of Alaska Anchorage, H. Schweiker/WIYN and NOIRLab/NSF/AURA)

My First Data Science Project (2/5)

Data set, Approach and Pre-processing

3 min readOct 10, 2020

As I said in my previous post the question to answer was:

Am I able to reproduce supernova classification based on spectroscopy (aka higly informative observation method) using a sufficiently large number of photometric observation (aka low informative observation method)?

The data set

The data set I used contained something like 20K simulated supernova; each simulation was provided in up to four optical filters as a time series of light measurements (in astronomy called light curves), with associate error for each data point. Thus, all in all, the data set contained 80K light curves; the following figure shows an example of such light curves.

example light curve of simulated supernova with data point and associated error bars.

As the figure shows, light curves are characterised by a fast rising in the first days, a maximum, and then a decay, fast or slow depending on the star that exploded. Light curves could also show a lower second maximum or a plateau following the maximum.

As it happens in real astronomy, data points in the light curves are not equally spaced (some night clouds could cast the sky hiding the supernova); also the precision with which the received light is determined will not always be the same, thus each data point can have a different associated error. One more thing that can happen in real astronomy is that the supernova could be observed after the maximum, thus the light curve would have only the decaying part.

The approach

The supporting idea behind this project was to build a data driven classifier, with as less assumptions as possible, and using weak assumptions where needed: information would have to come only from the dataset.

Reasoning backwards, from the final target it is easier understand which kind of preprocessing is needed:

The final target is a classification model, whose the goal is to find clusters of similar objects (supernova light curves) in space;
Clusters are defined on the bases of similarities between objects: similar objects are expected to be near in a parameter space, that is why they cluster;
To asses similarities, objects have to be comparable.

As pointed above, light curves in the data set are not comparable at all: their data points are on different and irregular time grids, and are subjected to astrophysical effects of different magnitude.

Pre-Processing

The task of this step is:

To “clean” the light curves from the astrophysical effects;
To report them on the same, regular time grid, which means to have the same number of equally spaced data points.

The astrophysical effects to take into account are all related to the reddening of the emitted radiation, caused interstellar extinction, and time-dilation.

Interstellar extinction is caused by dust grains on the path between the supernova and the Earth, which absorb and scatter light. The process is complex, involving radiation wavelength, dust grains shape and composition, and the line of sight direction. For us here is enough to know that bluer light is more affected them redder light. The correction was carried out using the extinction caused by the host galaxy and by the Milky Way; the first was provided in the data set, while the second was calculated.

Time-dilation is due to the expansion of the Universe, which causes the wavelength of the light wave travelling toward the Earth to get “stretched”, to become longer, thus redder. Information to correct for thins effect (the so-called redshift) was included in the dataset.

To report the light curves measurements on the same, regular time grid I used a regression method called Gaussian Processes, a non parametric method which allows to find the best curve interpolating a set of points using weak assumptions. I found Gaussian Processes very powerful and it deserves its own post… see you there.

To be continued …

My First Data Science Project (2/5)

Data set, Approach and Pre-processing

The data set

The approach

Pre-Processing

Written by Marco De Pascale