Galaxies and Random Forests

Peter Ma
Peter Ma
Dec 15, 2018 · 6 min read

Look at these two pictures, can you tell the difference? Classification problems like these are super important to astronomers because they reveal the evolutionary path of galaxies. Noticing the differences between celestial objects is difficult; they are all so different but similar at the same time.

Before, astronomers would classify galaxies by hand with the Hubble’s Tuning Fork. The Tuning Fork has set labels and characteristics that classify the galaxies, revealing its evolutionary path based on its shape.

Above is the Hubble’s Tuning Fork.

Hubble’s Tuning fork assisted astronomers with classification for decades. And this is how it works.

The top right of the fork are the spiral galaxies, galaxies with bands of spiral arms. These tend to be younger galaxies since they have a well-defined shape. As you transition from right to left the galaxies get older and become an elliptical galaxy. The bottom right rung of the fork are barred spiral galaxies (a variation on spiral galaxies). The more elliptical (furthest to the left) the older they are since their original spiral shape collapsed over billions of years. So what's the challenge?

Can you tell if this a spiral or elliptical galaxy? You can’t really see the structure from this odd angle. Hubble’s classification fails!

The process of classifying galaxies gets worse when your viewing angles change and images become distorted. This is where Hubble’s Classification begins to fail.

Furthermore, astronomers collect tons and tons of data from surveys of the sky. Its painstaking problem to classify all that data by hand. Imagine filtering through petabytes worth of data. So much time can be saved.

A method for simplifying this problem is to use machine learning to classify features instead of doing it by hand. A specific example is using a decision tree/random forest.

Colour filter readings from Sloan Digital Sky Survey (SDSS)

Astronomers would collect features of galaxies to train a computer algorithm. Features commonly used are color indices, adaptive movements (orientation), eccentricities (numerical description of shape), and concentrations (luminosity of the center of the galaxy). These features tend to be easily defined and unique to each galaxy.

The reason why astronomers choose decision trees is that of it easy to implement and work well with scale.

Here’s a quick intro to decision trees and random forests.

The Intuition Of Decision Trees and Random Forests.

Firstly, a decision tree is exactly like a tree… well an upside down tree. The idea is to model the flow of one decision, to another decision, to another decision and so forth. Each branch represents a decision and at each node, the decision will split into two possibilities, denoted as negative and positive outcomes.

A general layout of Decision Tree.

The first decision is called the root node, each decision connected after that is called a branch. At the end of the tree model, our final classification would be called a leaf node.

However, most single trees models are fairly weak by themselves. What can we do? We create a forest of trees. In a forest, each tree will process the information slightly differently and output a different classification. Each final output of the tree act as a vote and the model takes the average of the entire forest and crunches out the final output. The reason why we use a forest because the classifications tend to be made with higher confidence the more trees you have, it also decreases bias in the model.

Here is a walkthrough of my own Random Forest implementation.

My Model

My model uses the Sloan Digital Sky Survey (SDSS) sample of 780 examples to train my small model on classifying galaxies. Here is how I did it.

First, I imported python libraries, numpy (data manipulation), math (basic math operations), sklearn.tree ( Machine Learning Model) and matplotlib (visualization). These libraries reduce the amount of coding significantly and are generally run faster since they are optimized. Next, we shuffle the data and partition the sections into different training and test sets. By splitting up the data we can use the same source of images to see how well our model did. This process is done with a simple method which returns a spliced training and test set.

Then we set features of our data. This will include the color index, adaptive movements (orientation), eccentricities (numerical description of shape), and concentrations (luminosity of the center of the galaxy). This method will label the original data and return the correctly labeled training set.

**Note: the u-g, g-r, r-i, i-z are colour indices. Ecc denotes eccentricity values, m4r i represents adaptive movements and the rest is concentration data**

After that, we code a very simple decision tree classifying model. First, you label the data then you split and shuffle the data for training. Then you call the DecisionTreeClassifier model from sklearn to learn that would accept the arguments of your model and the labels. After that, the model would optimize within itself. Then the model would crunch out the prediction, returning it at the end. After that, the accuracy of the model would be calculated.

Now what’s even better is to implement a random forest, built with our previous forest!

The methods call for the dataset and the dimensions of the forest (number of trees). The data is loaded into the method and the random forest classifier is called from the library. This forest accepts features and the prediction is calculated.

Here’s How Things Went.

It did pretty okay even with such a tiny data sample! Keep in mind my dataset was tiny compared to a full-scale project often requiring tens of thousands of data samples.

My model made a false prediction trial number 94

This is the output of my final program after training (6 example outputs). The left-hand side is the prediction made bt the model and the right-hand side is the actual data. The programmed scored an accuracy of 81% on the test set which can be improved with a larger dataset.

The problem with this model is that its way to simple. The simplicity introduces noise and limits the model’s ability to extrapolate values beyond the training set.

CNN’s for images are better for classifications, stay tuned for my next article on computer vision (here).

This is a quick and simple implementation of my code meant to allow you to build your own DT model. Visit my GitHub Project here!

Key Takeaways

  • Galaxies classification is difficult for astronomers.
  • Classification of data is super important in revealing the formation of galaxies.
  • Astronomers can use Random Forest Classification to classify the galaxies automatically.
  • These models are super easy to implement but aren’t always the best at classifications

Before You Go

Connect with me on LinkedIn
Feel free to reachout to e-mail me with any questions:
And clap the article if you enjoyed 😊

Data Driven Investor

from confusion to clarity not insanity

Peter Ma

Written by

Peter Ma

A.I Enthusiast | Passionate about Astronomy

Data Driven Investor

from confusion to clarity not insanity

More From Medium

More from Data Driven Investor

More from Data Driven Investor

More from Data Driven Investor

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade