Data Science in Astronomy

Michael Gordon
Trends in Data Science
10 min readAug 18, 2019

--

Astronomy is a scientific discipline which has a long history stretching back to antiquity and even has roots in ancient indigenous peoples of the Americas and Australian aboriginals (Wikipedia 2019; Bhathal, R. 2006). The primary focus of Astronomy is the study of celestial objects and phenomena (Wikipedia 2019) as well as the development and validation of scientific theories concerning such phenomena.

There are many branches or subfields in astronomy however this paper will be focused primarily on examples of data science techniques and approaches to handling data in Radio Astronomy and Optical Astronomy.

Radio Astronomy which studies the radio frequencies emitted by celestial objects such as stars, quasars, pulsars and masers. Radio astronomy is attributed to the discovery of the cosmic microwave background radiation which is a notable piece of evidence for the Big Bang theory (the scientific theory, not the TV show) (Wikipedia 2019), and more recently the Event Horizon Telescope (EHT) has captured the first-ever images of the event horizon of a black hole (Wikipedia 2019).

Optical Astronomy also is known as visible-light astronomy, is by far the oldest branch of astronomy and with the availability of high-quality telescopes to the general public it is practised widely amongst amateur astronomers and professional astronomers alike, however the business end of optical astronomy is done with the space-based telescopes such as the Hubble Space Telescope and in future the James Webb Space Telescope planned for launch on 30th March 2021 (Wikipedia 2019).

Some of the challenges and factors driving innovation in the astronomy are the volume of data being generated by various instruments, high noise to signal ratio in the data being captured, and in some cases lack of observations of rare phenomena.

Volume of data — In optical astronomy the Galaxy Zoo project has been highly successful in classifying galaxy morphology using a crowdsourcing approach, classifying galaxy images at a rate of approximately 100,000 images per year, however with the Large Synoptic Survey Telescope (LSST) coming online 2020 which is anticipated to image approximately 1010 galaxies, using the same crowdsourcing approach to classify these galaxies this is estimated to take approximately 106 years, clearly this task will need to be automated (Kuminski et al 2014).

High Noise to Signal Ratio — In radio astronomy the data captured by instruments used to observe the skies typically exhibits a lot of noise, background thermal noise as well as interference from local terrestrial sources such as mobile phones, Wi-Fi and television and other broadcast signals. Boosting the signal to noise ratio and hence the detection rate of signals presents a significant challenge (Zhang et al 2018).

Lack of Observations — Some astronomical events and phenomena such as Fast Radio Bursts (FRB) are extremely rare, the rarity of these events and therefore lack of observations make them difficult to study in general, this also has implications for machine learning approaches as there are simply not enough labelled examples to train a model (Zhang et al 2018).

Galaxy Morphological Classification using Machine Learning

The Galaxy Zoo (GZ) and Galaxy Zoo 2 (GZ2)projects run by the Zooniverse citizen science initiative was set up to assist in the morphological classification of large datasets of galaxy images (Wikipedia 2019; Kuminski et al 2014). Whilst this project has been shown to be very effective at classifying galaxies, this approach is not going to scale up to the demands of modern digital sky surveys powered by robotic telescopes (Kuminski et al 2014). However, the data sets produced by the GZ & GZ2 projects may prove crucial into the development of an automated galaxy classification system (Kuminski et al 2014).

One example of how this data might be used is the experimental work carried out by Kuminski et al (2014) which has shown that the GZ2 dataset can be used to train a machine learning algorithm, which can then be used to automatically classify various aspects of galaxy morphology.

An important aspect of the GZ and GZ 2 data is that each image is classified multiple times by multiple users, for some galaxies where the morphological features are ambiguous there can be a high degree of variance in the classifications (Kuminski et al 2014). To provide the machine learning algorithm with the cleanest possible training data, images with a high degree of agreement on classification are required, however there is a trade-off to be made here as this will reduce the number of images available for training the model, and depending on how high you set the agreement threshold some classes of galaxy can end up being completely excluded from the training set (Kuminski et al 2014). Additionally, this can lead to an unbalanced data set which can introduce bias, to balance the dataset the number of samples in each galaxy class is reduced to match the that of the smallest class (Kuminski et al 2014).

Due to the complicated and heterogeneous nature of galaxy images a wide range of measures from different aspects of the image need to be analysed, and in this particular instance, the Wndchrm schema was chosen (Kuminski et al 2014). Wndchrm is an open source utility which was originally developed for biological image analysis however, it has been previously been used for classification of galaxy morphology in the past (Kuminski et al 2014).

The Wndchrm schema extracts over 2,000 descriptive features from each image, these features are drawn from various aspects of the image such as texture, pixel intensity, contrast and polynomial representations (Kuminski et al 2014; Shamir et al 2008). Several transforms are stacked on top of each other to extract further features which can be informative in some cases (Kuminski et al 2014; Shamir et al 2008). For Detailed description of the algorithms used by the Wndchrm utility refer to Orlov, N et al (2008) & Shamir et al (2008)

The Wndchrm schema was originally designed for analysing medical images, as such only a limited number of the features produced by this process are going to be sufficiently informative, additionally some of these features will be representative of noise in the imagery, retaining these features can impact negatively on the classification accuracy (Kuminski et al 2014; Shamir et al 2008). To address this issue Fisher discriminant scores are calculated for each feature and based on these scores only the top 5% of features are kept (Kuminski et al 2014; Shamir et al 2008). A Weighted Nearest Neighbors algorithm is then trained where the Fisher discriminant scores are used as weights against the features (Kuminski et al 2014).

The resulting model was able to achieve an accuracy of 85% on 8 out of 10 morphological classifications, however for some classifications, the high agreement threshold resulted in there being very little or no samples for training, for these galaxies automatic classification was not possible (Kuminski et al 2014).

Detecting Fast Radio Bursts using Machine Learning

An FBR is a radio pulse usually quite short in duration and emanating from a high-energy astrophysical object or event, whilst the energies are extremely high at the source, the signal is very weak by the time it has reached the earth, making it very difficult to detect (Wikipedia 2019). The origins and inner workings of FBRs are yet to be fully explained, some theorise that FBRs are emitted by highly magnetized neutron stars coming into contact with high-velocity gas streams near a supermassive black hole, there are other theories that suggest that these could be technological signatures of an advanced alien civilization (Breakthrough Initiatives 2018).

Traditional methods of detecting FBRs using de-dispersion based algorithms are susceptible to mistaking RFI and background noise for FBRs, leading to a high false positive rate (Zhang et al 2018). Further to this observed FBR pulses exhibit a signal to noise ratio which extends into the detection threshold of these algorithms, this suggests the presence of further FBR pulses waiting to be discovered in archival datasets (Zhang et al 2018).

One example of how modern Data Science techniques specifically Machine Learning (ML) has been used to detect signals which have been missed by traditional de-dispersion techniques, is the work carried out by Zhang et al at UC Berkeley. The Breakthrough Listen team conducted a five-hour observation of FBR 121102, the initial analysis of this 400TB dataset detected 21 bursts, a scientific paper was written and accepted for publication in the Astrophysical Journal on this analysis alone (Breakthrough Initiatives 2018). Further analysis was conducted by the BL team where Zhang et al developed a Convolution Neural Network (CNN) which was able to detect an additional 72 FBRs within this existing dataset (Zhang et al 2018).

There were a number of challenges in conducting this work which was address in some particularly innovative ways, firstly a lack of labelled examples of FBR signals for training the model, and secondly the high noise to signal ratio and thirdly the interpretability of the model and hence the appropriateness of how the model is used.

Lack of Labeled examples

The rarity of known sources of FRB pulses presents a significant problem for supervised learning which requires a labelled training set (Zhang et al 2018). The BL team was able to work around this issue by simulating the FRB pulses and superimposing the simulations over the top of preprocessed negative examples (Zhang et al 2018). Using this approach the BL team was able to produce a training set consisting of 400,000 images, half positive simulated examples and the other half negative example containing only noise and RFI characteristics (Zhang et al 2018).

Noisy data

Radio astronomy data typically exhibit high levels of noise, and this aspect of the data makes detection of FBRs particularly difficult, especially when the FBR signals are often well below the noise amplitude (Zhang et al 2018). This is one of the issues the BL team had to address in their work developing the CNN model. To increase the signal to noise ratio large convolution kernels and strides were implemented, pooling is avoided with the exception of the last fully connected layer, this has the effect of reducing the noise by compressing the output volume resulting in simpler features, skip connections are utilised to enable lower level features which are typical of FBR signals to propagate to deeper layers in the network (Zhang et al 2018).

Interpretability of the model

The interpretability of the model, and determining how it has arrived at a decision is often flagged as an issue with using neural networks. In this case, the confidence threshold was set relatively high (98% confidence) to reduce the number of false positives, then the remaining detections are manually screened for RFI and subjected to the traditional dedispersion algorithms to verify the signal (Zhang et al 2018). Consequently, the CNN model is used as a tool for detecting the FBR signals however, these detections are not taken on face value, traditional techniques are used to confirm the presence of a signal, therefore the interpretability of the model is not of great concern (Zhang et al 2018).

Conclusion

The groundbreaking work carried out by Zhang et al (2018) has has demonstrated that by utilising machine learning algorithms the rate of detection of FBRs can be boosted to unprecedented levels, the abundance of detections from this work has even lead to speculation that FBRs may not be as rare as was previously thought (Zhang et al 2018). It will be interesting to see if follow-up work utilising machine learning to analyse new observations and archival data will confirm this hypothesis.

In optical astronomy with the LSST expected to come online next year the volumes of data and the effort required to classify the galaxies is anticipated the be enormous, the demand for an automatic classification system for galaxies images is clear (Kuminski et al 2014). The work done by Kuminski et al (2014) demonstrates the potential of using machine learning to at least partially automate the classification process. This automated approach can be used in conjunction with a GZ crowdsourcing approach for classifying the more complex galaxies (Kuminski et al 2014).

Data Science and Machine Learning is evolving at such a rapid rate with new domains of application emerging all the time, from the detecting FBRs to classifying galaxy morphology it is clear that data science techniques and machine learning are already a contributing factor in driving innovation in the area of observational astronomy, and no doubt we will continue to see new and exciting developments in this space.

References

Zhang, Y., Gajjar, V., Foster, G., Siemion, A., Cordes, J., Law, C. and Wang, Y. (2018). Fast Radio Burst 121102 Pulse Detection and Periodicity: A Machine Learning Approach. The Astrophysical Journal, 866(2), p.149.
[online] Available at:
https://arxiv.org/abs/1809.03043

Kuminski, E., George, J., Wallin, J. and Shamir, L. (2014). Combining Human and Machine Learning for Morphological Analysis of Galaxy Images. Publications of the Astronomical Society of the Pacific, 126(944), pp.959–967.
[online] Available at:
https://iopscience.iop.org/article/10.1086/678977

Shamir, L., Orlov, N., Eckley, D., Macura, T., Johnston, J. and Goldberg, I. (2008). Wndchrm — an open source utility for biological image analysis. Source Code for Biology and Medicine, 3(1).
[online] Available at:
https://scfbm.biomedcentral.com/articles/10.1186/1751-0473-3-13
[Accessed 12 Apr. 2019]

Orlov, N., Shamir, L., Macura, T., Johnston, J., Eckley, D. and Goldberg, I. (2008). WND-CHARM: Multi-purpose image classification using compound image transforms. Pattern Recognition Letters, 29(11), pp.1684–1693.
[online] Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0167865508001530
[Accessed 12 Apr. 2019]

En.wikipedia.org. (2019). Astronomy. [online] Available at: https://en.wikipedia.org/wiki/Astronomy [Accessed 27 Mar. 2019].
Bhathal, R. (2006). Astronomy in Aboriginal culture. Astronomy and Geophysics, 47(5), pp.5.27–5.30.
[online] Available at:
https://academic.oup.com/astrogeo/article/47/5/5.27/231805
[Accessed 25 Mar. 2019]

Dataskeptic.libsyn.com. (2019). Data Skeptic : Detecting Fast Radio Bursts with Deep Learning. [online] Available at: http://dataskeptic.libsyn.com/detecting-fast-radio-bursts-with-deep-learning
[Accessed 31 Mar. 2019].

Breakthroughinitiatives.org. (2019). Breakthrough Initiatives. [online] Available at: https://breakthroughinitiatives.org/news/22
[Accessed 6 Apr. 2019]

En.wikipedia.org. (2019). Radio astronomy. [online] Available at: https://en.wikipedia.org/wiki/Radio_astronomy
[Accessed 31 Mar. 2019].

En.wikipedia.org. (2018). Visible-light astronomy. [online] Available at: https://en.wikipedia.org/wiki/Visible-light_astronomy
[Accessed 12 Apr. 2019].

En.wikipedia.org. (2019). James Webb Space Telescope. [online] Available at: https://en.wikipedia.org/wiki/James_Webb_Space_Telescope
[Accessed 12 Apr. 2019].

En.wikipedia.org. (2019). Large Synoptic Survey Telescope. [online] Available at: https://en.wikipedia.org/wiki/Large_Synoptic_Survey_Telescope
[Accessed 10 Apr. 2019].

En.wikipedia.org. (2012). Square Kilometre Array. [online] Available at: https://en.wikipedia.org/wiki/Square_Kilometre_Array
[Accessed 26 Mar. 2019].

En.wikipedia.org. (2019). Fast radio burst. [online] Available at: https://en.wikipedia.org/wiki/Fast_radio_burst
[Accessed 6 Apr. 2019].

--

--