Classifying Galaxies Using a Multilayered Perceptron Model

Rohan Sai . N
The Startup
Published in
7 min readAug 25, 2020
Hubble Ultra Deep Field, Source - NASA/ESA

Cosmos is an intriguing space to observe and analyse, it is the stronghold for any science we discovered to this point. Galaxies are the mines with immense data clusters just lying there to be explored in this infinite space. the above picture is the result of a 1 Million-second-long exposure of Hubble space telescope revealing the earliest galaxies named as Hubble Ultra Deep Field. Each star-like object in the photograph is actually an entire galaxy.

The first galaxy was observed by a Persian astronomer Abd al-Rahman over a 1,000 years ago, and it was first believed to be an unknown extended structure. which is now known as the Messier-31 or the infamous Andromeda Galaxy. From that point, these unknown structures are more frequently observed and recorded but it took over 9 centuries for the astronomers to manifest on an agreement that they were not simply any astronomical objects but are entire galaxies. As the discoveries of these galaxies increased astronomers observed the divergent morphologies. Then they started grouping the previously reported and the newly discovered galaxies according to the morphological characteristics which later formed a significant classification Scheme.

Modern Advancements

Astronomy in this contemporary age has massively evolved parallelly with computational advancement during the years. Sophisticated computational techniques like Machine Learning models are much more efficient now because of the substantially increased computers performance efficiency and the enormous data that we have now. Centuries ago, Classification tasks like these were done by hand with a massive group of people, evaluating the results by cross-validation and collective post-agreement.

Why I chose to work on this Data

I recently started working on theoretical Deep Learning concepts and wanted my first practical approach of Technically applying those concepts to a task, which is driven by passion, yet is still prominent and relatable to my core attempt of learning Neural Networks.

Data Collection

The Galaxy Zoo project hosted diverse Sky-survey data Catalogues online for astronomers around the world who are allowed to access, study and analyse the data. For this Classification task, I grabbed the data with the features very well defining the classes. For feature description [METADATA].

The Classification Schema

The Hubble’s Tuning fork is the most famous Classification Scheme, Edwin Hubble divided Galaxies into three main types to be simplified they are Elliptical, Spiral and Merger classes.

Elliptical galaxies have a smooth, spheroidal appearance with little internal structure. They are dominated by a spheroidal bulge and have no prominent thin disk. Spiral galaxies all show spiral arms. And the third class, Merger galaxies are Irregular looking often described as chaotic appearances, are most likely the remnants of a collision between two Galaxies as described in The Realm of Nebulae.

Diving Deep into the Neural Networks

The astronomical data, reading the data from the flat file,

The Dimensionality of the entire data in (Row, Column )-shape(667944, 13)

Data Preprocessing

The first column doesn’t have any effect on the final performance of this classification model because it is not at all correlational to the classes, OBIJD is a unique id used for a particular object of interest in the dataset, the RA ( Right Ascension ) and Dec ( Declination ) on the other hand are recorded absolute positions of these objects of interest, which are also unique to every datapoint so dropping the 3 columns would be better, aiming for better accuracy. we can do that using,

data = data.drop([‘OBJID’,’RA’,’DEC’],axis=1)

lookout for Null values and NaNs,

Since this is a classification task we need to check for Class Imbalance, In a dataset where we are performing a classification task even if its Binary, Class Imbalance can have a major effect in the training phase, and ultimately on the accuracy. To plot the value_counts for the three-class columns we can do that by the following code snippet.

plt.figure(figsize=(10,7))
plt.title('Count plot for Galaxy types ')
countplt = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']]
sns.countplot(x="variable",hue='value', data=pd.melt(countplt))
plt.xlabel('Classes')

Inferring from the plot, considering 1’s, there exists a class imbalance of the classes, but this is not a prominent difference to have any effect on the model's performance, considering that they are in a One-Hot Encoding format so obviously 0’s will be in more numbers. It’s in my typical workflow with the classification tasks to check for an Imbalance, I believe it to be a good practise. ( Note, we only consider 1’s from the above plot since 1 determines the class for a set of features and is 0 for all the other classes with the same features )

Normalisation and train_test_split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
X = data.drop(['SPIRAL','ELLIPTICAL','UNCERTAIN'],axis=1).values
y = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=101)scaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

For any Machine Learning model to learn from the data its a conventional method to split the original data into Training Sets and Testing Sets, where the split percentages are 80% training set and 20% testing set. and the whole dataset at least should have 1,000 data points to avoid any overfitting and to simply increase the learning period of any model.

Instantiating Neural Network and setting Hyper-parameters

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Sequential, in Keras, allows us to manufacture the Multilayered Perceptron model from scratch. We can add each layer with a unit number as a parameter for the Dense function where every unit number implies that many Densely connected neurons.

model = Sequential()

# Input Layer
model.add(Dense(10,activation='relu’))
# Hidden Layer
model.add(Dense(5,activation='relu’))

# Output Layer
model.add(Dense(3, activation = 'softmax'))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy’])

Explanation,

The Arrangement of my Neural Network is 10–5–3, 10 input neurons since we have 10 feature columns, 3 output neurons because we have 3 output classes everything in between is arbitrarily selected, 5 nodes in between are collectively called as a Hidden Layer.

Each Neuron/perceptron performs a complex calculation followed by an Activation function, In my case, I used the most common activation, ReLu ( Rectified Linear unit ) and for the last three output neurons, the Activation is Softmax, which returns the probability distributions of the three classes.

Adam Optimizer is used to achieve an efficient Gradient decent i.e the optimal minimum of the bounded weights, and the conventional loss function used for a Multi-class Classification task is Categorical_crossentropy. Metrics for realtime evaluation will be Accuracy for any classification task.

Following is the Theoretical Model Description for the above Sequential Setup.

Deploying the Model

start = time.perf_counter()
model.fit(x=X_train,y=y_train,epochs=20)
print('\nTIME ELAPSED {}Seconds'.format(time.perf_counter() - start))

Note, Time Elapsed will be different for different Computer configurations and it is completely optional to calculate the time-taken, I just did it because I was intrigued to find out how long it would take for all the epochs.

Plotting Accuracy at each Epoch

From this Accuracy plot, We can infer that after a certain epoch, i.e approximately from the 6th epoch, the accuracy remained constant for all the other epochs. ( A single Epoch means, 1 whole cycle through the entire Training Set )

Achieved Model Accuracy is 0.90, i.e 90%

Classification Report

Improvement?

Sophisticatedly structured feature-data is paramount for any Machine Learning/Deep Learning model, more the number of defining features for a particular target class, more will be the anticipated performance. Feature Engineering is the most essential stage in any Data analysis task, but one cannot efficiently perform feature extraction without having any domain knowledge about the task/data that he/she’s handling, the leverage of knowing how to correlate the features does not exist. Feature engineering in any Machine learning is always important to be performed in order to derive more features by mathematically relating the existing ones. The basic approach can be just by looking at the correlational matrix of the data and further investigating the feature columns based on their correlation trends. Well, This is Core Astronomical Data with some features which I cannot even begin to interpret, If I had an Astronomy background to study, organise and add more features then, it will be certain that this model can perform way better than what It did.

Interstellar

We’ve always defined ourself by the ability to overcome the impossible.

For the Source Code visit my GitHub where I integrated this task along with the other Computational Astronomy concepts I worked on, I call it The Anecdote of Computational-Astronomy ~ GitHub

PS, This is my first ever blog post, I want to develop this habit of writing whatever I experience while drifting through the time.

References :

Edwin Powell Hubble, - The Realm of the Nebulae.

--

--

Rohan Sai . N
The Startup

“𝙲𝚎𝚖𝚎𝚗𝚝𝚒𝚗𝚐 𝚝𝚑𝚎 𝚟𝚘𝚒𝚍 𝚋𝚎𝚝𝚠𝚎𝚎𝚗 𝚝𝚑𝚎 𝚒𝚗𝚌𝚎𝚙𝚝𝚒𝚘𝚗 𝚘𝚏 𝚝𝚑𝚘𝚞𝚐𝚑𝚝 𝚊𝚗𝚍 𝚔𝚗𝚘𝚠𝚒𝚗𝚐 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲”