Softmax function and the maths behind it.

Shine Mohammed
7 min readJan 6, 2022

--

Hey my fellow peers. I always believe that in any field, fundamentals are the most important and it always give you a good hold on to the domain. And data science is not different. So in this article I am excited to discuss with you the fundamentals of well-liked, our beloved non-linear softmax function. So lets build from bottom top!!

As we all know softmax function scales down any values in between 0 and 1. But more over, softmax function is used as an activation function in the output layer in neural networks to predict a multinomials probabilistic distribution. In simple words, it gives a probabilistic distribution of which class the current input belong to. A more detailed explanation of this vaguely told bold statement, is in the later part of this article. No holding back, lets jump in!!

Agenda

By the end of the article, you should be able to understand what is softmax function, its application in neural networks and its optimisation using loss function.

Contents

  • Sketchy Details
  • Probabilistic interpretation
  • Derivative of softmax
  • Softmax and its optimisation(Cross entropy)

Sketchy Details

As we all know softmax function scales down any values in between 0 and 1. But softmax is not that soft.

Mathematically telling softmax function takes any N dimensional vector and outputs N dimensional values with values between 0 and 1. softmax(a): R^N → R^N. Or more precisely:

where

we can observe from the softmax function, all values will be positive due to exponent and will be in the range 0 and 1 and adds upto 1, due to summation in the denominator. Lets take an example vector for instance and apply softmax over it, [1.0, 2.0, 5.0] and its softmax version will be [0.02, 0.05, 0.93]. We can see the order of the values are preserved though they are scaled. We can call softmax functions as an emphasiser function. It scale bigger values close to 1 and smaller values close to 0.

Probabilistic Interpretation

Okay, we understood what softmax function does. But some fascinating probabilistic story is hidden in the softmax vector. Lets understand this Probabilistic Interpretation using an example. Lets say, we are buliding a multi-class classification model where we input a vector of size 5 and some weight matrix M(10*5) is applied and applying softmax on it, we get a probability distribution vector of size 10. The values of the vector gives us a sense of to which class the the particular input belong to. Its simply telling us whats the probability of the input to be of specific class given the input.

Diagramatic representation of a simple logistic regression model
S_i ithe each element of the P vector in the above diagram

Recap of the calculus

Before diving deep into the derivative of softmax function, lets flip some pages of our calculus book to understand, what does it actually mean to take a derivative of a vector. Normally derivative of a function f(x) is written as:

Coming to the point, normally derivative of softmax function values is taken with respect to a particular input value:

Here we are taking the partial derivative of ith output w.r.t jth input. Since our input and outputs are in the form of vector, we should be taking jacobian to obtain the complete derivative of a vector.

Jacobian matrix as told by my professor in school :)

Derivative of softmax

Enough of the basics, lets understand the derivative of softmax function. Derivative of softmax is extensively computed during chain rule in backpropogation in a neural network. If you want to learn more about the backprpogation and maths behind, binge watch this amazing video by . Mathematical expression for softmax function is given by:

From here, its all school grade calculus… lets play around and get some interesting results.

If a function f(x) is of the form:

by quotient rule:

similarly, we can find the derivative of the softmax function:

Since the denominator summation is a constant, lets take it as K, so the above expression becomes:

executing the above expression for i ≠j , we get:

and executing the above expression for i = j , we get:

The derivative can be written as:

or

where δij is the Kronecker delta function or identity matrix.

It is always recommeded to come up with the most condensed form, so it is easier to compute gradients and complex derivatives later.

Softmax and its optimisation(Cross entropy)

As we already discussed , softmax function is used in the output layer to get a probability distribution of the output classes corresponding to a partcular input. A simpler diagrammatic image classification representation is shown below.

Ok, all right… softmax function comes with a probabiity distribution of the classes for each input(here its watermelon image), but what helps the model to reach the correct predictions(or output vector probabilities). Here cross entropy comes into picture. Softmax function with cross entropy as the loss function is the most popular brotherhood in the machine learning world. Lets understand how both of them trick maths to give us good results.

Cross Entropy

Cross entropy is a measure of of the difference between two probabilistic distributions for a given random variable or set of events. To make cross entropy a bit easier lets revise what us entropy. Entropy is simply a measure of how certain is a probability distribution for a given random variable X. For a uniform distribution, entropy will be the highest(Balanced probability distribution) and for a sharp normal distribution, entropy values will be less showing the probabilities are more certain(skewed probability distribution).

source link

for a random discrete variable X ,

Then why a negative sign? since p(x) are probabilities are log of values between 0 and 1 in negatives. You got the point!

If you want to know more about the information entropy, read this article.

Coming back to cross entropy, they measure the the difference comparing two probability distributions. So for machine learning use cases, one probability distribution will be true vector which will be one-hot encoded and other will be the predicted probabilties.

Cross entropy for 2 probabilty distribution P(X) and Q(X) is calculated by:

where pi is the predicted probabilty of for the ith class and qi is the true value for the ith class.

So the loss function for the multi class classification models with softmax as the final layer, will have a loss function of the form in the above expression.

If we take probability from the softmax function to be P(k) and the actual one hot encoded vector to be Q(k) at the kth position. Then we can write the above term as:

If you are interested in the calculation of derivative of cross-entropy, refer this article. I hope this article helped you in understanding softmax more personally and how it works in the background. Keep learning, share knowledge until then….

--

--