How to implement the derivative of Softmax independently from any loss function

Ms Aerin
Published in
2 min readSep 3, 2017

The main job of the Softmax function is to turn a vector of real numbers into probabilities.

The softmax function takes a vector as an input and returns a vector as an output. Therefore, when calculating the derivative of the softmax function, we require a Jacobian matrix, which is the matrix of all first-order partial derivatives.

In math formulas, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is written as:

where the red delta is a Kronecker delta.

How can we put this into code?

If you implement this iteratively in python:

import numpy as np

def softmax_grad(s):
# Take the derivative of softmax element w.r.t the each logit which is usually Wi * X
# input s is softmax value of the original input x.
# s.shape = (1, n)
# i.e. s = np.array([0.3, 0.7]), x = np.array([0, 1])
# initialize the 2-D jacobian matrix.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1-s[i])
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m

Let’s test.

In [95]:  x = np.array([1, 2])def softmax(z):
z -= np.max(z)
sm = (np.exp(z).T / np.sum(np.exp(z), axis=0)).T
return sm
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])

If you make a vectorized version of it:

soft_max = softmax(x)    def softmax_grad(softmax):
# Reshape the 1-d softmax to 2-d so that will do the matrix multiplication
s = softmax.reshape(-1,1)
return np.diagflat(s) -, s.T)

In [18]: softmax_grad(soft_max)
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])

Ms Aerin

Engineer. Love teaching math concepts intuitively.