Welcome! If you are learning about the Softmax function for the first time please read our beginner friendly article Understand Softmax in Minutes. If you are a machine learning professional, a data scientist, likely you will want to learn more about Softmax — Softmax beyond the basics, Softmax use cases, Softmax “in the wild”. This article covers the stuff experts say, beyond the basics.
Post under construction — Updated Weekly
How should you use this article? Read our disclaimer here. Should not be used for production or implementation or commercial purpose. Just for your personal reading.
Softmax Formula | Softmax Activation Function
Softmax is used as the activation function for multi-class classification tasks, usually the last layer. We talked about its role transforming numbers (aka logits) into probabilities that sum to one. Let’s not forget it is also an activation function which means it helps our model achieve non-linearity. Linear combinations of linear combinations will always be linear but adding activation function helps gives our model ability to handle non-linear data.
Output of other activation functions such as sigmoid does not necessarily sum to one. Having outputs summing to one makes softmax function great for probability analysis.
The important assumption is that the true class labels are independent. That is to say each sample of data can only belong to one class. For example, an image cannot be both a cat and a dog at the same time. Its true label can only belong to one class. For multi-class classification, for images that have multiple objects and true classes, use multi-class logistical regression instead. For two class example, 0 or 1 true or false positive or negative, just logistical regression.
Why is Softmax function called Softmax?
The name Softmax is a bit confusing and does not describe exactly what the function does. Other names for Softmax includes … Here’s the best explanation on the internet on why Softmax is called Softmax, check out this answer on Quora. In short, it is a smooth / soft approximation of the max function, which kind of looks like a ReLU as well. The smooth and soft part is the key, that’s what makes this function differentiable.
This Quora post by Mr. Abhishek Patnia (Staff Machine Learning Engineer at Tinder) is the best! He also provided this amazing image showing that Softmax nicely approximates max(0, input), especially when input is not near zero.

Textbook Softmax
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. — Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning textbook
Softmax Formula

In the basic version of this article Understand Softmax in Minutes we illustrated how to implement this formula using Python list comprehension. That basic implementation is a teaching aid, not a production ready code repo.
Softmax on StackOverflow

Softmax on Wikipedia Before and After Softmax
Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes. — Wikipedia

Softmax aka softargmax, normalized exponential function, which literally describes what it does, is a function that takes as an input a vector, normalizes it into a probability distribution consisted of the same dimension as the input vector. Prior to applying softmax, some vector components could be negative, or great than one, and might not sum to 1. After applying softmax, each component will be in the range between 0 and 1 and all the components will add up to 1, so they can be interpreted as probabilities. Larger components will correspond to larger probabilities. Often, softmax is used in neural networks mapping non-normalized output of a network to a probability distribution over predicted output classes. (Source: wikipedia. Paraphrased)
As explained in our beginner friendly article, Understand Softmax in Minutes, the non-normalized input for a Softmax function is called logits. This input is usually an output of a neural network. The Softmax function takes in logits and outputs probabilities that sum to one. Why? Because each logit is raised to the e to a power that equates the logit, then the e exponent is normalized by the sum of all the e exponents. Hence the entire collection of output always sums to one.
Cross Entropy Loss Best Buddy of Softmax
Read more about cross entropy loss in our tutorial
Graphing the Softmax function
Coming soon
Where does Softmax fit in the deep learning workflow?
Often Softmax is the last layer of a multi-class classification architecture. A great example is a popular model called VGG16
used in computer vision image classification tasks.

Softmax Implementation in Production
There are quite a few flavors of Softmax. Choosing the best option is a matter of computational efficiency and accuracy. Here we dive into the API code and source code of Pytorch and Tensorflow to see how Facebook and Google implement the function for production and research use.
Pytorch Implementation of Softmax
Coming soon
Below is the function signature of Softmax in Pytorch.
It’s helpful to see how Pytorch defines the function signature. Here are some important highlights: 1. rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1
. 2. The LaTeX formula of Softmax is \text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
which gives this screenshot below.

class Softmax(Module):
r"""Applies the Softmax function to an n-dimensional input Tensor
rescaling them so that the elements of the n-dimensional output Tensor
lie in the range [0,1] and sum to 1.
Softmax is defined as:
.. math::
\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
Shape:
- Input: :math:`(*)` where `*` means, any number of additional
dimensions
- Output: :math:`(*)`, same shape as the input
Returns:
a Tensor of the same dimension and shape as the input with
values in the range [0, 1]
Arguments:
dim (int): A dimension along which Softmax will be computed (so every slice
along dim will sum to 1).
.. note::
This module doesn't work directly with NLLLoss,
which expects the Log to be computed between the Softmax and itself.
Use `LogSoftmax` instead (it's faster and has better numerical properties).
Examples::
>>> m = nn.Softmax(dim=1)
>>> input = torch.randn(2, 3)
>>> output = m(input)
"""
__constants__ = ['dim']def __init__(self, dim=None):
super(Softmax, self).__init__()
self.dim = dimdef __setstate__(self, state):
self.__dict__.update(state)
if not hasattr(self, 'dim'):
self.dim = None@weak_script_method
def forward(self, input):
return F.softmax(input, self.dim, _stacklevel=5)def extra_repr(self):
return 'dim={dim}'.format(dim=self.dim)
The Softmax transformation can be summarized with this pattern F.softmax(logits, dim=1)
.
Tip for using Softmax result in Pytorch:
Choosing the best Softmax result: in multi-class classification, the activation Softmax function is often used. Pytorch has a dedicated function to extract top results — the most likely class from Softmax output. torch.topk(input, k, dim)
returns the top probability. pytorch.topk documentation.
torch.topk(input, k, dim=None, largest=True, sorted=True, out=None) -> (Tensor, LongTensor)
Returns the k largest elements of the given input tensor along a given dimension.
If dim is not given, the last dimension of the input is chosen.
If largest is False then the k smallest elements are returned.
Implement Pytorch from Scratch in Pytorch
def softmax(x):
return torch.exp(x)/torch.sum(torch.exp(x), dim=1).view(-1,1)
Top part is the exponential of each x. In the denominator also needs to take exponentials so torch.exp(x)
again. dim=1
is for torch.sum()
to sum up each row across all the columns. .view(-1,1)
is for preventing broadcasting. Each we don’t reshape the denominator, then the top matrix (with all the image data is a matrix of many rows) and will try to divide by torch.sum(torch.exp(x), dim=1)
which is a vector. But really we want each of the top to be divided by a single number so we reshape the vector tensor. If the top is in_feature by 10
in the MNIST example, then we want to shape the bottom to be 1 by in_feature
so that the result is a in_feature by 10
tensor. Very tricky. Use the API for ease of use and accuracy, use the customized formula for fast performance, a lighter object.
Tensorflow Implementation of Softmax
Post under construction
According to official documentation TensorFlow Core 1.13 , tf.nn.softmax This function performs the equivalent of softmax = tf.exp(logits) / tf.reduce_sum(tf.exp(logits), axis)
Similarly, developers can use the keras API equivalent tf.keras.activations.softmax
tf.sparse.softmax applies softmax to a batched N-D SparseTensor.
There’s also tf.nn.softmax_cross_entropy_with_logits_v2 which comes which computes softmax cross entropy between logits and labels. (deprecated arguments). Warning: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results. The corresponding cross entropy API in tensorflow past is softmax_cross_entropy_with_logits_v2
Implementing Softmax from Scratch
@ Kaggle @ Rachael Tatman doing a live softmax implementation from scratch.
Select the Maximum Softmax Output
After the Softmax layer, in Pytorch, we still need to select the maximum value index as the top label. One from-scratch way to do it is to use pytorch.argmax
Here’s an example demonstrating
x = torch.FloatTensor([[0.2, 0.1, 0.7],[0.6, 0.2, 0.2],[0.1, 0.8, 0.1]])
y = torch.argmax(x, dim=1)
y
# tensor([2, 0, 1])
y = torch.argmax(x, dim=0)
y
# tensor([1, 2, 0])
Notice that argmax
outputs the position of the max value not the max value itself across rows dim=1
and columns dim=0
.
Softmax vs Sigmoid
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y(i)∈{0,1}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y(i)∈{1,…,K} where K is the number of classes. — Stanford Deep Learning Tutorial
Softmax is also known as Multi-Class Logistical Regression (citation needed). Using sigmoid activation for classification task, we want to turn all logits into 2 classes (0,1). The output should be rounded to either 0 or 1. Number of classes k = 2.
Sigmoid isa special case of Softmax where the possible outcomes are just 0 or 1
(Under construction) To understand the two functions are equivalent, it is important to think of the ground truth in binary classification can only take two forms 0 or 1 and the predicted labels are between 0 to 1. Softmax calculates P(class = c | logits)
explained verbosely: probability of the class is c given the an array of logits as input. Remember the logits are the near final layer of a neural network so it is in pseudo code matrix_mulitplication_of (inputs , weights)
. The weights are learned. In binary classification, the weights can only either be 0 or 1. For example, P(class = 1 | logits), we know that we are trying to calculate the probability of class=1 (for class=0 we just calculate one minus P(class=1|logits). Softmax function calculates the e exponent of current class, divided by the sum of e exponents of all other classes. np.exp(logit given that weight is 1)
divided by the sum of np.exp(logit given that weight is 0) + np.exp(logit given the weight is 1)
. The logit given weight is 0 is matrix multiplication of (inputs and 0) which is 0. e exponent of zero is one. So the formula simplifies to np.exp(logit given weight is 1) / (1 + np.exp(logit given weight is 1)
. It is more of a proof not an obvious substitution. Took our staff writer a few tries to understand why Softmax and Sigmoid are equivalent for two class classifications when Udacity’s Luis Serrano mentioned it. If you like reading formula, this quora post explained it the best.
Softmax in Technical Interviews
Further Reading
Difference between LogSoftmax and Softmax in Pytorch
class Softmax(Module): … This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use `LogSoftmax` instead (it’s faster and has better numerical properties) — .Pytorch Softmax documentation
Other Flavors of Softmax
Full softmax versus candidate sampling softmax.