Softmax Function Beyond the Basics

May 11 · 9 min read

Welcome! If you are learning about the Softmax function for the first time please read our beginner friendly article Understand Softmax in Minutes. If you are a machine learning professional, a data scientist, likely you will want to learn more about Softmax — Softmax beyond the basics, Softmax use cases, Softmax “in the wild”. This article covers the stuff experts say, beyond the basics.

Post under construction — Updated Weekly

How should you use this article? Read our disclaimer here. Should not be used for production or implementation or commercial purpose. Just for your personal reading.

Softmax Formula | Softmax Activation Function

Softmax is used as the activation function for multi-class classification tasks, usually the last layer. We talked about its role transforming numbers (aka logits) into probabilities that sum to one. Let’s not forget it is also an activation function which means it helps our model achieve non-linearity. Linear combinations of linear combinations will always be linear but adding activation function helps gives our model ability to handle non-linear data.

Output of other activation functions such as sigmoid does not necessarily sum to one. Having outputs summing to one makes softmax function great for probability analysis.

The important assumption is that the true class labels are independent. That is to say each sample of data can only belong to one class. For example, an image cannot be both a cat and a dog at the same time. Its true label can only belong to one class. For multi-class classification, for images that have multiple objects and true classes, use multi-class logistical regression instead. For two class example, 0 or 1 true or false positive or negative, just logistical regression.

Why is Softmax function called Softmax?

The name Softmax is a bit confusing and does not describe exactly what the function does. Other names for Softmax includes … Here’s the best explanation on the internet on why Softmax is called Softmax, check out this answer on Quora. In short, it is a smooth / soft approximation of the max function, which kind of looks like a ReLU as well. The smooth and soft part is the key, that’s what makes this function differentiable.

This Quora post by Mr. Abhishek Patnia (Staff Machine Learning Engineer at Tinder) is the best! He also provided this amazing image showing that Softmax nicely approximates max(0, input), especially when input is not near zero.

Textbook Softmax

Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. — Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning textbook

Softmax Formula

Softmax Formula

In the basic version of this article Understand Softmax in Minutes we illustrated how to implement this formula using Python list comprehension. That basic implementation is a teaching aid, not a production ready code repo.

Softmax on StackOverflow

Source: wikipedia Softmax definition and formula

Softmax on Wikipedia Before and After Softmax

Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes. — Wikipedia

Softmax on Wikipedia

Softmax aka softargmax, normalized exponential function, which literally describes what it does, is a function that takes as an input a vector, normalizes it into a probability distribution consisted of the same dimension as the input vector. Prior to applying softmax, some vector components could be negative, or great than one, and might not sum to 1. After applying softmax, each component will be in the range between 0 and 1 and all the components will add up to 1, so they can be interpreted as probabilities. Larger components will correspond to larger probabilities. Often, softmax is used in neural networks mapping non-normalized output of a network to a probability distribution over predicted output classes. (Source: wikipedia. Paraphrased)

As explained in our beginner friendly article, Understand Softmax in Minutes, the non-normalized input for a Softmax function is called logits. This input is usually an output of a neural network. The Softmax function takes in logits and outputs probabilities that sum to one. Why? Because each logit is raised to the e to a power that equates the logit, then the e exponent is normalized by the sum of all the e exponents. Hence the entire collection of output always sums to one.

Cross Entropy Loss Best Buddy of Softmax

Read more about cross entropy loss in our tutorial

Graphing the Softmax function

Coming soon

Where does Softmax fit in the deep learning workflow?

Often Softmax is the last layer of a multi-class classification architecture. A great example is a popular model called VGG16 used in computer vision image classification tasks.

Softmax Implementation in Production

There are quite a few flavors of Softmax. Choosing the best option is a matter of computational efficiency and accuracy. Here we dive into the API code and source code of Pytorch and Tensorflow to see how Facebook and Google implement the function for production and research use.

Pytorch Implementation of Softmax

Coming soon

Below is the function signature of Softmax in Pytorch.

It’s helpful to see how Pytorch defines the function signature. Here are some important highlights: 1. rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1. 2. The LaTeX formula of Softmax is \text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)} which gives this screenshot below.

class Softmax(Module):
r"""Applies the Softmax function to an n-dimensional input Tensor
rescaling them so that the elements of the n-dimensional output Tensor
lie in the range [0,1] and sum to 1.
Softmax is defined as:
.. math::
\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
- Input: :math:`(*)` where `*` means, any number of additional
- Output: :math:`(*)`, same shape as the input
a Tensor of the same dimension and shape as the input with
values in the range [0, 1]
dim (int): A dimension along which Softmax will be computed (so every slice
along dim will sum to 1).
.. note::
This module doesn't work directly with NLLLoss,
which expects the Log to be computed between the Softmax and itself.
Use `LogSoftmax` instead (it's faster and has better numerical properties).
>>> m = nn.Softmax(dim=1)
>>> input = torch.randn(2, 3)
>>> output = m(input)
__constants__ = ['dim']
def __init__(self, dim=None):
super(Softmax, self).__init__()
self.dim = dim
def __setstate__(self, state):
if not hasattr(self, 'dim'):
self.dim = None
def forward(self, input):
return F.softmax(input, self.dim, _stacklevel=5)
def extra_repr(self):
return 'dim={dim}'.format(dim=self.dim)

The Softmax transformation can be summarized with this pattern F.softmax(logits, dim=1) .

Tip for using Softmax result in Pytorch:

Choosing the best Softmax result: in multi-class classification, the activation Softmax function is often used. Pytorch has a dedicated function to extract top results — the most likely class from Softmax output. torch.topk(input, k, dim) returns the top probability. pytorch.topk documentation.

torch.topk(input, k, dim=None, largest=True, sorted=True, out=None) -> (Tensor, LongTensor)
Returns the k largest elements of the given input tensor along a given dimension.
If dim is not given, the last dimension of the input is chosen.
If largest is False then the k smallest elements are returned.

Implement Pytorch from Scratch in Pytorch

def softmax(x):
return torch.exp(x)/torch.sum(torch.exp(x), dim=1).view(-1,1)

Top part is the exponential of each x. In the denominator also needs to take exponentials so torch.exp(x) again. dim=1 is for torch.sum() to sum up each row across all the columns. .view(-1,1) is for preventing broadcasting. Each we don’t reshape the denominator, then the top matrix (with all the image data is a matrix of many rows) and will try to divide by torch.sum(torch.exp(x), dim=1) which is a vector. But really we want each of the top to be divided by a single number so we reshape the vector tensor. If the top is in_feature by 10 in the MNIST example, then we want to shape the bottom to be 1 by in_feature so that the result is a in_feature by 10 tensor. Very tricky. Use the API for ease of use and accuracy, use the customized formula for fast performance, a lighter object.

Tensorflow Implementation of Softmax

Post under construction

According to official documentation TensorFlow Core 1.13 , tf.nn.softmax This function performs the equivalent of softmax = tf.exp(logits) / tf.reduce_sum(tf.exp(logits), axis)

Similarly, developers can use the keras API equivalent tf.keras.activations.softmax

tf.sparse.softmax applies softmax to a batched N-D SparseTensor.

There’s also tf.nn.softmax_cross_entropy_with_logits_v2 which comes which computes softmax cross entropy between logits and labels. (deprecated arguments). Warning: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results. The corresponding cross entropy API in tensorflow past is softmax_cross_entropy_with_logits_v2

Implementing Softmax from Scratch

@ Kaggle @ Rachael Tatman doing a live softmax implementation from scratch.

Select the Maximum Softmax Output

After the Softmax layer, in Pytorch, we still need to select the maximum value index as the top label. One from-scratch way to do it is to use pytorch.argmax

Here’s an example demonstrating

x = torch.FloatTensor([[0.2, 0.1, 0.7],[0.6, 0.2, 0.2],[0.1, 0.8, 0.1]])
y = torch.argmax(x, dim=1)
# tensor([2, 0, 1])
y = torch.argmax(x, dim=0)
# tensor([1, 2, 0])

Notice that argmax outputs the position of the max value not the max value itself across rows dim=1and columns dim=0.

Softmax vs Sigmoid

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y(i)∈{0,1}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y(i)∈{1,…,K} where K is the number of classes. — Stanford Deep Learning Tutorial

Softmax is also known as Multi-Class Logistical Regression (citation needed). Using sigmoid activation for classification task, we want to turn all logits into 2 classes (0,1). The output should be rounded to either 0 or 1. Number of classes k = 2.

(Under construction) To understand the two functions are equivalent, it is important to think of the ground truth in binary classification can only take two forms 0 or 1 and the predicted labels are between 0 to 1. Softmax calculates P(class = c | logits) explained verbosely: probability of the class is c given the an array of logits as input. Remember the logits are the near final layer of a neural network so it is in pseudo code matrix_mulitplication_of (inputs , weights). The weights are learned. In binary classification, the weights can only either be 0 or 1. For example, P(class = 1 | logits), we know that we are trying to calculate the probability of class=1 (for class=0 we just calculate one minus P(class=1|logits). Softmax function calculates the e exponent of current class, divided by the sum of e exponents of all other classes. np.exp(logit given that weight is 1) divided by the sum of np.exp(logit given that weight is 0) + np.exp(logit given the weight is 1) . The logit given weight is 0 is matrix multiplication of (inputs and 0) which is 0. e exponent of zero is one. So the formula simplifies to np.exp(logit given weight is 1) / (1 + np.exp(logit given weight is 1) . It is more of a proof not an obvious substitution. Took our staff writer a few tries to understand why Softmax and Sigmoid are equivalent for two class classifications when Udacity’s Luis Serrano mentioned it. If you like reading formula, this quora post explained it the best.

Softmax in Technical Interviews

Further Reading

Difference between LogSoftmax and Softmax in Pytorch

class Softmax(Module): … This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use `LogSoftmax` instead (it’s faster and has better numerical properties) — .Pytorch Softmax documentation

Other Flavors of Softmax

Full softmax versus candidate sampling softmax.


Written by


Learn coding, data and software package skills with Uniqtech tutorials and articles. Contact us We’d like to hear from you!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade