A Simple Introduction to Softmax

5 min readMay 10, 2023

Softmax normalizes an input vector into a probability distribution using the exponential function.

Overview

According to DeepAI:

The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1.
Softmax is a generalization of logistic regression that can be used for multi-class classification, and its formula is very similar to the sigmoid function which is used for logistic regression. The softmax function can be used in a classifier only when the classes are mutually exclusive.
Many multi-layer neural networks end in a penultimate layer which outputs real-valued scores that are not conveniently scaled and which may be difficult to work with. Here the softmax is very useful because it converts the scores to a normalized probability distribution, which can be displayed to a user or used as input to other systems. For this reason it is usual to append a softmax function as the final layer of the neural network.

The Softmax Function’s Components

The Input

The input to the softmax function is a vector of K elements, where z without an arrow represents an element of the vector:

An example can be seen below:

The Numerator

Softmax applies the exponential function to each element of the vector, returning the highest output value for the highest input value. Any negatives also become positive since its range is (0, ∞). This can be seen in the plot above or by examining the interval below.

The Denominator

The summation in the denominator normalizes each element by ensuring the function will sum to 1 and create a probability distribution. All the exponentiated elements are added together, so when each exponentiated element is divided by this sum, it will be a percentage of it. The summation of the exponentiated elements of [5, 7, 10] can be seen below:

Example with a Vector

This example will use a 3-element vector, [5, 7, 10], to demonstrate softmax’s normalization capabilities.

The i indicates each element of the vector is passed in on its own to create a vector of K elements as output.

Since K = 3, the function will be calculated three times:

The output is [0.006, 0.047, 0.946], which is about 1. Technically, it is 0.999 due to truncation. The smallest input, 5, has the lowest probability, and the highest value, 10, has the highest probability.

PyTorch has a softmax function that can be used to automatically calculate this, but it can also be calculated using the exponentiation and summation functions.

import torch

# set the vector to a tensor
z = torch.Tensor([5, 7, 10])

# apply softmax
softmax = torch.exp(z) / torch.sum(torch.exp(z))

tensor([0.0064, 0.0471, 0.9465])

Example with a Matrix

In order to sum the values of each vector in the matrix above, torch.sum must have axis = 1 in order to sum each row, and keepdims = True in order to preserve its shape. This creates a matrix of 3 one-element vectors that can be broadcast during division:

This can be seen in the code below.

x = torch.Tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

torch.sum(x, axis=1, keepdims=True)

tensor([[ 6.],
        [15.],
        [24.]])

If keepdims = False, then the output would be a single vector with three elements: [6, 15, 24]. This can be seen below.

torch.sum(x, axis=1, keepdims=False)

tensor([ 6., 15., 24.])

With this in mind, almost the same calculation can be used to calculate softmax on the matrix:

# apply softmax
softmax = torch.exp(x) / torch.sum(torch.exp(x), axis=1, keepdims=True)

tensor([[0.0900, 0.2447, 0.6652],
        [0.0900, 0.2447, 0.6652],
        [0.0900, 0.2447, 0.6652]])

Each vector now adds up to about 1. However, all three vectors are the same even though the input vectors were different. Why is this? It turns out that as long as the exponentiated values have the same distance between them, the outcome will be the same:

Torch Implementation

The same output can be generated by using nn.Softmax.

import torch.nn as nn

x = torch.Tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

# sum each row
softmax_layer = nn.Softmax(dim=1)

output = softmax_layer(x)

tensor([[0.0900, 0.2447, 0.6652],
        [0.0900, 0.2447, 0.6652],
        [0.0900, 0.2447, 0.6652]])