Softmax vs LogSoftmax

4 min readOct 10, 2021

softmax is a mathematical function which takes a vector of K real numbers as input and converts it into a probability distribution (generalized form of logistic function, refer figure 1) of K probabilities proportional to the exponential of input numbers, that is, before applying softmax, the vector components can be negative or greater than 0, but after applying softmax, each component will be in the interval [0,1], and the components will add up to 1, so that we can interpret these values as probabilities.

Have you wondered the why the name ‘softmax’?

Well, according to ‘Neural Networks for Pattern Recognition’ book,

“The term softmax is used because this activation function represents a smooth version of the winner-takes-all activation model in which the unit with the largest input has output +1 while all other units have output 0.”

-Page 238.

softmax function is defined by the formula,

Exponential function is applied to each element Zi of the input vector Z and normalize these values by dividing by the sum of all these exponentials. This normalization ensures that sum of components in output vector is 1.

The most common use of softmax in Machine Learning/ Deep Learning is it’s use as an activation function in neural networks. Activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The activation function introduces the non-linearity into the output of a neuron.

In a multi-class classification problem, the network is configured to output N values, one for each class in the classification task. Softmax function is used to normalize the outputs, converting them from weighted some values to probabilities, summing up to 1. Each value in the output of softmax function is the probability representing each class.

The softmax function is a good way of normalizing any value from [-infy, +infy] by applying an exponential function. However, it may create issues sometimes as we get large output values for small values of input. For example,

As the numbers are too big the exponents will probably blow up (computer cannot handle such big numbers) giving Nan as output. Also, dividing large numbers from Equation 1, can be numerically unstable.

Log softmax is advantageous over softmax for improved numerical performance and gradient optimization.

Log softmax is the log of softmax function, mathematically,

At the heart of using log-softmax over softmax is the use of log probabilities over probabilities, a log probability is simply a logarithm of a probability. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard

unit interval. Since the probabilities of independent events multiply, and logarithms convert multiplication to addition, log probabilities of independent events add. Log probabilities are thus practical for computations.

We can implement log softmax using PyTorch,