Semih Gülüm
Deeper Deep Learning TR
6 min readMar 16, 2021

--

Softmax : An Activation Function

This article was written within the scope of the studies we carried out at AdresGezgini R&D Center.

We use linear regression when we want to get a numerical estimate from the data we have (for example, computer price). However, in some cases we may want our output to be a class rather than a numerical prediction. The perception of a picture as a cat or a dog can be given as an example of the concept of class. Here, cat and dog represent a class, and we expect the output of the model to have a probability value for each class. Therefore, the more classes we have, the more output we have. Softmax comes into play in cases where the model should give probability as such output.

Why Do We Need Softmax?

When a dice is rolled, each face has a certain probability of coming up, and this probability is never negative. However, if it is not within the specified classes, it can take a negative value. For example, the probability of rolling a 1 is positive (1/6), while the probability of rolling a 7 is 0.

In CNN architecture, we have to do this control. Otherwise, negative values may occur for the probability of the determined classes. But negative values go against the basic axiom of probability.

There are two main rules for us to interpret our prediction output as ŷ probability:

1. No probability can be less than 0. (nonnegativity)

2. The sum of the probabilities of all classes must be 1.

Let’s try to derive the formula ourselves:

To ensure the second rule:

For the normalization process, the data whose probability is to be calculated must be divided by the sum of all data. Thanks to this operation, the probability sums (the probability sum of all classes) become 1.

In order for the sum of xj and xk in the above formula to obey the probability axiom, both rules must be satisfied. We have two options for ensuring the first rule:

  • Using absolute value
  • Express exponentially

The best way to understand that the use of absolute value will mislead us would be to consider it through an example.

Let’s give an example where we will input the [-2, 0, 2, 5, 7] data from the model as input to Softmax:

When we examined using the values of -2 and 2, we saw that the probability of both is equal. However, we expect the probability we get from -2 to be much smaller than 2, since the input values we give to softmax are kind of unnormalized versions of our estimation values.

Let’s consider another option, the exponential.

This is how we were able to distinguish between negative and positive values, which we could not do in absolute values. In addition, we have ensured to keep our values between [0,1]. If we observe this with our largest value, 7:

Another advantage of using exponential is that probability results do not change if a constant is added to all values. To observe this, let’s assume that the constant 100 is added to all our inputs:

However, if we had done this with a normal probability distribution formula:

As we can see, even if all of our values change in the same way, our probability distribution is highly affected.

Kindly Reminder: Another important point to know when using Softmax is that the features are given to the model as vectors. These vectors are as long as the number of classes and only one element is equal to 1. This method is called one-hot encoding. To give a simple example, let’s consider the classes of a shoe color. Assuming that the shoe we have has the colors red, black, blue, and white. Then our vectors will have lengths of 4. The class it represents in each vector is expressed as 1 and the remainder as 0. In this case, the vectors we use to represent the classes we consider as red, black, blue and white are respectively; (1,0,0,0) would be (0,1,0,0), (0,0,1,0) and (0,0,0,1).

Important Note: The Softmax operation compresses the probability distribution between 0 and 1. But it doesn’t change the “importance” of probabilities. In other words, the class with the max probability will also have the max probability when it exits softmax. Let’s revisit our example:

  • argmax(presosftmax(ŷ))=argmax(softmax(ŷ))

A Final Look at Softmax

Explaining why the formulation is the way it is

NOTE: If x_k in the numerator is too large, it may be overflow, and if x_j is too small, it may be underflow. Overflow is when it reaches the point where it does not fit in the area of the positive exponent, while underflow is when it reaches the point where it doesn’t fit in the area of the negative exponent. For example, all numbers held in two’s complement form have a limit. When representing a 4-bit integer in two’s complement form, the largest integer value that can be represented is 7 (0111), and the smallest integer value is -8 (1000). It is not possible to represent 9 in this way, that is, it is not possible to represent the sum of the numbers 5 (0101) and 4 (0100), which exist in four-bit representation, with 4 bits.

Now it’s code time!!

Softmax activation function
Output of the code above

The purpose of this explanation is to show that everyone can develop their own activation functions according to the needs of the model. Activation functions with simple mathematical operations behind them can be written from scratch, specific to the problem. For example, when we want probabilities higher than 0.3 in the future, it is not even sincere to write an activation function called “Deep2er_learning” by doing np.max(0.3, y).

See you in the next article!

Writer Support

References

--

--

Semih Gülüm
Deeper Deep Learning TR

Data Scientist at Accenture || Data Science MSc. Student at Sabanci University || Writing articles on Data Science & Deep Learning