How to implement the Softmax derivative independently from any loss function?

Aerin Kim πŸ™

Mathematically, the derivative of Softmax Οƒ(j) with respect to the logit Zi (for example, Wi*X) is

where the red delta is a Kronecker delta.

If you implement iteratively:

import numpy as npdef softmax_grad(s): 
# Take the derivative of softmax element w.r.t the each logit which is usually Wi * X
# input s is softmax value of the original input x.
# s.shape = (1, n)
# i.e. s = np.array([0.3, 0.7]), x = np.array([0, 1])
# initialize the 2-D jacobian matrix.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1-s[i])
jacobian_m[i][j] = -s[i]*s[j]
return jacobian_m

Let’s test.

In [95]:  x = np.array([1, 2])def softmax(z):
z -= np.max(z)
sm = (np.exp(z).T / np.sum(np.exp(z), axis=0)).T
return sm
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])

If you implement it in a vectorized version:

soft_max = softmax(x)    def softmax_grad(softmax):
# Reshape the 1-d softmax to 2-d so that will do the matrix multiplication
s = softmax.reshape(-1,1)
return np.diagflat(s) -, s.T)

In [18]: softmax_grad(soft_max)
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])

Aerin Kim πŸ™

Written by

I’m a Research Engineer at Microsoft AI Research and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium β€” and support writers while you’re at it. Just $5/month. Upgrade