Published in

Deeper Deep Learning TR

6 min readMar 16, 2021

Softmax : An Activation Function

This article was written within the scope of the studies we carried out at AdresGezgini R&D Center.

We use linear regression when we want to get a numerical estimate from the data we have (for example, computer price). However, in some cases we may want our output to be a class rather than a numerical prediction. The perception of a picture as a cat or a dog can be given as an example of the concept of class. Here, cat and dog represent a class, and we expect the output of the model to have a probability value for each class. Therefore, the more classes we have, the more output we have. Softmax comes into play in cases where the model should give probability as such output.

Why Do We Need Softmax?

When a dice is rolled, each face has a certain probability of coming up, and this probability is never negative. However, if it is not within the specified classes, it can take a negative value. For example, the probability of rolling a 1 is positive (1/6), while the probability of rolling a 7 is 0.

In CNN architecture, we have to do this control. Otherwise, negative values may occur for the probability of the determined classes. But negative values go against the basic axiom of probability.

There are two main rules for us to interpret our prediction output as ŷ probability:

1. No probability can be less than 0. (nonnegativity)

2. The sum of the probabilities of all classes must be 1.

Let’s try to derive the formula ourselves:

To ensure the second rule:

For the normalization process, the data whose probability is to be calculated must be divided by the sum of all data. Thanks to this operation, the probability sums (the probability sum of all classes) become 1.

In order for the sum of xj and xk in the above formula to obey the probability axiom, both rules must be satisfied. We have two options for ensuring the first rule:

Using absolute value
Express exponentially

The best way to understand that the use of absolute value will mislead us would be to consider it through an example.

Let’s give an example where we will input the [-2, 0, 2, 5, 7] data from the model as input to Softmax:

When we examined using the values of -2 and 2, we saw that the probability of both is equal. However, we expect the probability we get from -2 to be much smaller than 2, since the input values we give to softmax are kind of unnormalized versions of our estimation values.

Let’s consider another option, the exponential.

This is how we were able to distinguish between negative and positive values, which we could not do in absolute values. In addition, we have ensured to keep our values between [0,1]. If we observe this with our largest value, 7:

Another advantage of using exponential is that probability results do not change if a constant is added to all values. To observe this, let’s assume that the constant 100 is added to all our inputs:

However, if we had done this with a normal probability distribution formula:

As we can see, even if all of our values change in the same way, our probability distribution is highly affected.

Kindly Reminder: Another important point to know when using Softmax is that the features are given to the model as vectors. These vectors are as long as the number of classes and only one element is equal to 1. This method is called one-hot encoding. To give a simple example, let’s consider the classes of a shoe color. Assuming that the shoe we have has the colors red, black, blue, and white. Then our vectors will have lengths of 4. The class it represents in each vector is expressed as 1 and the remainder as 0. In this case, the vectors we use to represent the classes we consider as red, black, blue and white are respectively; (1,0,0,0) would be (0,1,0,0), (0,0,1,0) and (0,0,0,1).

Important Note: The Softmax operation compresses the probability distribution between 0 and 1. But it doesn’t change the “importance” of probabilities. In other words, the class with the max probability will also have the max probability when it exits softmax. Let’s revisit our example:

argmax(presosftmax(ŷ))=argmax(softmax(ŷ))

A Final Look at Softmax

Explaining why the formulation is the way it is

NOTE: If x_k in the numerator is too large, it may be overflow, and if x_j is too small, it may be underflow. Overflow is when it reaches the point where it does not fit in the area of the positive exponent, while underflow is when it reaches the point where it doesn’t fit in the area of the negative exponent. For example, all numbers held in two’s complement form have a limit. When representing a 4-bit integer in two’s complement form, the largest integer value that can be represented is 7 (0111), and the smallest integer value is -8 (1000). It is not possible to represent 9 in this way, that is, it is not possible to represent the sum of the numbers 5 (0101) and 4 (0100), which exist in four-bit representation, with 4 bits.

Now it’s code time!!

Softmax activation function

The purpose of this explanation is to show that everyone can develop their own activation functions according to the needs of the model. Activation functions with simple mathematical operations behind them can be written from scratch, specific to the problem. For example, when we want probabilities higher than 0.3 in the future, it is not even sincere to write an activation function called “Deep2er_learning” by doing np.max(0.3, y).

See you in the next article!

Writer Support

Secilay Kutal - MediumRead writing from Secilay Kutal on Medium. Every day, Secilay Kutal and thousands of other voices read, write, and…
medium.com

References

Dive into Deep Learning - Dive into Deep Learning 0.16.1 documentationJan 2021] Check out the brand-new Chapter: Attention Mechanisms. We have also completed PyTorch implementations. To…
d2l.ai

Softmax Activation Function ExplainedAnd implemented from scratch.
towardsdatascience.com

Aritmetik taşmaMantık devrelerinde taşma, devrenin sağladığı bit alanının işlem sonucunda ortaya çıkan verinin elde bulunan saklama…
tr.wikipedia.org

Softmax : An Activation Function

Why Do We Need Softmax?

Let’s try to derive the formula ourselves:

A Final Look at Softmax

Writer Support

Secilay Kutal - Medium

Read writing from Secilay Kutal on Medium. Every day, Secilay Kutal and thousands of other voices read, write, and…

References

Dive into Deep Learning - Dive into Deep Learning 0.16.1 documentation

Jan 2021] Check out the brand-new Chapter: Attention Mechanisms. We have also completed PyTorch implementations. To…

Softmax Activation Function Explained

And implemented from scratch.

Aritmetik taşma

Mantık devrelerinde taşma, devrenin sağladığı bit alanının işlem sonucunda ortaya çıkan verinin elde bulunan saklama…

Written by Semih Gülüm