Deep Learning

Multi-Class Classification? Yes.

Let’s discuss what is it!

Danyal Jamil
The Startup
Published in
4 min readDec 31, 2020


Stuck behind the paywall? Click here to read the full story with my friend link!

This article is actually a continuum of a series that focuses on the basic understanding of the building blocks of Deep Learning. Some of the previous articles are, in case you need to catch up:

Machine Learning is not just classifying whether an image contains a dog or not. Neither is it just for predicting house predictions of Boston. When one gets into how many applications there really are of this vast industry, one usually gets stunned!

Back to the point, it’s not compulsory that a model can predict just one probability when given a scenario, for example, what if you want to check if an image contains a dog and a cat? You see? This is where we have multi-class classification.

Multi-class Classification

Multi-class classification is those tasks where examples are assigned exactly one of more than two classes. Binary Classification: Classification tasks with two classes. Multi-class Classification: Classification tasks with more than two classes.[1]

Softmax Regression

We have seen many examples of how to classify between two classes, i.e. Binary Classification. Now, we will discuss what to do if we want more than two classes classified.

Suppose we have four classes we want classify among. Then, our model’s last layer must have four nodes, each would be responsible of giving out the probability of that instance being true.

Model architecture for 4 predictions

Here is a labelled version.

same but labelled

So, the output layer will be of dimension (4, 1) because it’ll be giving out the probability of four instances. Also, the sum of the probabilities must sum to 1.

Softmax Layer

To get the probabilities amount different classes, we change the activation function.

The Softmax function is different from the activation functions we’ve been studying, the softmax is basically somewhat the probability of a classes out of all the classes.

Photo by Erik Mclean on Unsplash


T = e^Z[l] # where Z[l] is the activation of the last layera[l] = (e^Z[l]) / (Sum(t[I]))a[l][I] = t[I] / (Sum(t[I]))

The main difference between the Softmax and other Activation functions is that the other Activation functions take in a number and output a number but Softmax here, takes in a list of number and returns an array as well.

Softmax regression generalizes logistic regression to N classes.

And if N == 2:
Softmax regression essentials reduces to logistic regression.

Loss Function

Suppose this is the y = [0, 1, 0, 0] and y` is [0.1, 0.2, 0.4, 0.3].

So, the loss function is:

L(y, y`) = -Sum(y * log(y`))

Cost Function

Cost function will change to:

J(hyper parameters) = (1 / m) * Sum(L(y`, y))


In this article, we have just stepped into what it looks like to make our models make more predictions than just a single yes or no. We will continue this discussion with more concepts the next time! Follow for updates!


If you want to keep updated with my latest articles and projects, follow me on Medium. These are some of my contacts details:

Happy Learning. :)



Danyal Jamil
The Startup

Machine Learning Enthusiast | Student

Recommended from Medium


See more recommendations