“Is the term “softmax” driving you nuts?”

Yusaku Sako
2 min readJun 2, 2018

--

The softmax activation function is a basic building block that we use often in machine learning. Got a multi-class classification problem? Just whip out softmax to cast the vector to a probability distribution.

But the term softmax never sat well with me and really bugged the heck out of me every time I type it or read it!
The “soft” part is clear… It is continuous and differentiable so that you can do gradient descent to optimize the loss function. But “max”? C’mon, really?

Max returns the largest element of the input.
max([-2, 1, 5, 0]) => 5. This operation preserves the magnitude of the largest element.

Softmax, on the other hand, transforms an input vector to a probability distribution:
softmax([-2, 1, 5, 0]) => [0.0009, 0.0179, 0.9747, 0.0066]
This is not really telling uswhat the max of the input is. But it seems to tell us which element is the largest. Isn’t that argmax?

Argmax returns which element in the input is the largest:
argmax([-2, 1, 5, 0]) => 2 since element at index 2 is the largest (assuming 0-based index). This would be expressed in one-hot encoding as:
[0, 0, 1, 0]

This looks a lot like the softmax output: softmax is a soft version of one-hot encoded argmax, not max.

So what would be a “soft” version of max?
softmax(x) * x.T
Following the example above:
softmax([-2, 1, 5, 0]) * [-2, 1, 5, 0].T =>[0.0009, 0.0179, 0.9747, 0.0066] * [-2, 1, 5, 0].T => 4.8896
This is close to the actual max of 5.

This misnomer of softmax is also mentioned briefly in the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. The authors said:

It would perhaps be better to call the softmax function “softargmax” but the current name is an entrenched convention.

I have heard folks refer to the naming of softmax in contrast to hardmax, as if hardmax is the term that came before softmax to represent one-hot-encoded argmax. However, I can’t find any basis for this. The first usage of the word softmax seems to have come from “Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters” by John S. Birdie in 1990. To quote:

This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the ‘winner-take-all’ operation of picking the maximum value. For this reason we like to refer to it as soft max.

Note that there was no mention of softmax being the soft version of hardmax.

If softmax doesn’t bother you, then you are good. But it does as it did to me, think softargmax.

--

--