# How to implement the Softmax derivative independently from any loss function?

Mathematically, the derivative of Softmax **σ(j)** with respect to the logit **Zi** (for example, Wi*X) is

where the red delta is a Kronecker delta.

If you implement iteratively:

import numpy as npdefsoftmax_grad(s):

# Take the derivative of softmax element w.r.t the each logit which is usually Wi * X

# input s is softmax value of the original input x.

# s.shape = (1, n)

# i.e. s = np.array([0.3, 0.7]), x = np.array([0, 1]) # initialize the 2-D jacobian matrix.

jacobian_m = np.diag(s) for i in range(len(jacobian_m)):

for j in range(len(jacobian_m)):

if i == j:

jacobian_m[i][j] = s[i] * (1-s[i])

else:

jacobian_m[i][j] = -s[i]*s[j]

return jacobian_m

Let’s test.

In [95]: x = np.array([1, 2])defsoftmax(z):

z -= np.max(z)

sm = (np.exp(z).T / np.sum(np.exp(z), axis=0)).T

return smIn [96]: softmax(x)

Out[96]: array([ 0.26894142, 0.73105858])In [97]: softmax_grad(softmax(x))

Out[97]:

array([[ 0.19661193, -0.19661193],

[-0.19661193, 0.19661193]])

If you implement it in a vectorized version:

soft_max = softmax(x) defsoftmax_grad(softmax):

# Reshape the 1-d softmax to 2-d so that np.dot will do the matrix multiplication

s = softmax.reshape(-1,1)

return np.diagflat(s) - np.dot(s, s.T)

In [18]: softmax_grad(soft_max)Out[18]:

array([[ 0.19661193, -0.19661193],

[-0.19661193, 0.19661193]])