Computing the Jacobian matrix of a neural network in Python

Julien Herzen

Published in

Unit8 - Big Data & AI

8 min readNov 13, 2018

In general, a neural network is a multivariate, vector-valued function looking like this:

The function f has some parameters θ (the weights of the neural net), and it maps a N-dimensional vector x (e.g., the N pixels of a cat picture) to a M-dimensional vector (e.g., the probabilities that x belongs to each of M different classes):

During training, one usually attaches a scalar loss value to the output — a typical example for classification would be the cross-entropy loss over the predicted class probabilities. When using such a scalar loss, M=1 and the parameters are then learned by performing (stochastic) gradient descent, during which the gradient of the loss function with respect to θ is repeatedly computed. Therefore, it is very common to compute the gradient of a scalar output value with respect to the network’s parameters during training, and all common machine learning libraries can do this, usually very efficiently using automatic differentiation [1].

At inference time however, the output of the network is often a vector (e.g., the class probabilities). In such cases, it can be interesting to look at the Jacobian matrix of the network. In this article, I quickly explain what the Jacobian is, and then I explore and compare a few possible implementations done in Python.

What is the Jacobian matrix and why would we care?

Let’s call y the output vector of f. The Jacobian matrix of f contains the partial derivatives of each element of y, with respect to each element of the input x:

This matrix tells us how local perturbations the neural network input would impact the output. There are cases where this information can be valuable. For example, in ML systems used for creative tasks, it can be convenient to have the system provide the users with some interactive feedback, telling them how modifying each input dimension would impact each of the output classes.

Tensorflow

Let’s try right ahead with Tensorflow. First, we need a toy network to play with. Here I am just interested in computing the Jacobian of an existing network f at test time, so I do not focus on training. Let’s say we have a simple network [affine → ReLU → affine → softmax]. Let’s first define some random parameters:

import numpy as npN = 500  # Input size
H = 100  # Hidden layer size
M = 10   # Output sizew1 = np.random.randn(N, H)  # first affine layer weights
b1 = np.random.randn(H)     # first affine layer biasw2 = np.random.randn(H, M)  # second affine layer weights
b2 = np.random.randn(M)     # second affine layer bias

Using Keras, we implement our network as follows:

import tensorflow as tf
from tensorflow.keras.layers import Densesess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())model = tf.keras.Sequential()
model.add(Dense(H, activation='relu', use_bias=True, input_dim=N))
model.add(Dense(O, activation='softmax', use_bias=True, input_dim=O))
model.get_layer(index=0).set_weights([w1, b1])
model.get_layer(index=1).set_weights([w2, b2])

Let’s now try to compute the Jacobian of this model. Unfortunately, Tensorflow does not currently provide a method to compute Jacobian matrices out-of-the-box. The method tf.gradients(ys, xs) returnssum(dy/dx) for each x in xs, which in our case would be N-dimensional vector containing the sums of the Jacobian rows; not quite what we are looking for (see this issue). However, we can still compute our Jacobian matrix, by computing the gradients vectors for each yi, and grouping the output into a matrix:

def jacobian_tensorflow(x):    
    jacobian_matrix = []
    for m in range(M):
        # We iterate over the M elements of the output vector
        grad_func = tf.gradients(model.output[:, m], model.input)
        gradients = sess.run(grad_func, feed_dict={model.input: x.reshape((1, x.size))})
        jacobian_matrix.append(gradients[0][0,:])
        
    return np.array(jacobian_matrix)

Let’s make sure the Jacobian we compute is actually correct by checking it using numerical differentiation. The function is_jacobian_correct() below takes in argument a function computing the Jacobian and the feedforward function f:

def is_jacobian_correct(jacobian_fn, ffpass_fn):
    """ Check of the Jacobian using numerical differentiation
    """
    x = np.random.random((N,))
    epsilon = 1e-5    """ Check a few columns at random
    """
    for idx in np.random.choice(N, 5, replace=False):
        x2 = x.copy()
        x2[idx] += epsilon        num_jacobian = (ffpass_fn(x2) - ffpass_fn(x)) / epsilon
        computed_jacobian = jacobian_fn(x)
        
        if not all(abs(computed_jacobian[:, idx] - num_jacobian) < 1e-3): 
            return False    return Truedef ffpass_tf(x):
    """ The feedforward function of our neural net
    """    
    xr = x.reshape((1, x.size))
    return model.predict(xr)[0]is_jacobian_correct(jacobian_tensorflow, ffpass_tf)>> True

Very good, it’s correct. Let’s see how long this computation takes:

tic = time.time()
jacobian_tf = jacobian_tensorflow(x0, verbose=False)
tac = time.time()print('It took %.3f s. to compute the Jacobian matrix' % (tac-tic))>> It took 0.658 s. to compute the Jacobian matrix

So it takes about 650 ms on my Macbook Pro 4-cores CPU. It can be possible to do better with Tensorflow; but it doesn’t look like we could do drastically better at the time of writing, because Tensorflow anyways require to loop the gradient computation over the M outputs (note that I didn’t try using GPUs here). 650 ms is too slow for such a toy example, especially if we have interactive usage at test time in mind, so let’s see if we can do better.

Autograd

Autograd is a very nice library, which notably performs automatic differentiation on top of Numpy. To use it, we first have to specify our feedforward function f using Autograd’s encapsulated Numpy:

import autograd.numpy as anpdef ffpass_anp(x):
    a1 = anp.dot(x, w1) + b1   # affine
    a1 = anp.maximum(0, a1)    # ReLU
    a2 = anp.dot(a1, w2) + b2  # affine
    
    exps = anp.exp(a2 - anp.max(a2))  # softmax
    out = exps / exps.sum()
    return out

First, let’s check that this function is correct, by comparing it with our previous Tensorflow feedforward function ffpass_tf().

out_anp = ffpass_anp(x0)
out_keras = ffpass_tf(x0)np.allclose(out_anp, out_keras, 1e-4)>> True

OK, we have the same function f. Let’s now compute the Jacobian matrix. It’s straightforward with Autograd:

from autograd import jacobiandef jacobian_autograd(x):
    return jacobian(ffpass_anp)(x)is_jacobian_correct(jacobian_autograd, ffpass_np)>> True

OK good, we have the function and it looks correct. And how long does it take?

%timeit jacobian_autograd(x0)>> 3.69 ms ± 135 µs

Wow, so where our Tensorflow implementation took about 650 ms, Autograd needs 3.7 ms, giving us a ~170x speed increase in this case. Of course, it’s not always convenient to specify one’s model using Numpy, because Tensorflow and Keras provide a lot of useful functions and training facilities out-of-the-box… But now that we crossed this step and wrote our network using Numpy, could we perhaps make it even faster? Well, if you look at the implementation of the jacobian() function of Autograd, it turns out it’s still mapping over the dimensions of the function output. This is a hint that we could perhaps improve our results by relying on better vectorization with Numpy directly.

Numpy

If we want a proper Numpy implementation, we have to specify the forward and backward passes of each of our layers in order to implement backprop ourselves. I did it below for the three kinds of layers that our toy network contains — affine, ReLU and softmax. The implementation of the layers here is quite generic (it could be made more compact if we cared about this one network only). The signature is inspired from cs231n, but I slightly changed it to avoid returning the gradients w.r.t. the weights, as we do not need them here.

Basically, the backward pass now propagates matrices (or, in the general case, tensors) containing the gradients of each network output, and we use Numpy efficient vectorized operations:

def affine_forward(x, w, b):
    """
    Forward pass of an affine layer
    :param x: input of dimension (I, )
    :param w: weights matrix of dimension (I, O)
    :param b: biais vector of dimension (O, )
    :return output of dimension (O, ), and cache needed for backprop
    """
    out = np.dot(x, w) + b
    cache = (x, w)
    return out, cachedef affine_backward(dout, cache):
    """
    Backward pass for an affine layer.
    :param dout: Upstream Jacobian, of shape (M, O)
    :param cache: Tuple of:
      - x: Input data, of shape (I, )
      - w: Weights, of shape (I, O)
    :return the jacobian matrix containing derivatives of the M neural network outputs with respect to
            this layer's inputs, evaluated at x, of shape (M, I)
    """
    x, w = cache
    dx = np.dot(dout, w.T)
    return dxdef relu_forward(x):
    """ Forward ReLU
    """
    out = np.maximum(np.zeros(x.shape), x)
    cache = x
    return out, cachedef relu_backward(dout, cache):
    """
    Backward pass of ReLU
    :param dout: Upstream Jacobian
    :param cache: the cached input for this layer
    :return: the jacobian matrix containing derivatives of the M neural network outputs with respect to
             this layer's inputs, evaluated at x.
    """
    x = cache
    dx = dout * np.where(x > 0, np.ones(x.shape), np.zeros(x.shape))
    return dxdef softmax_forward(x):
    """ Forward softmax
    """
    exps = np.exp(x - np.max(x))
    s = exps / exps.sum()
    return s, s
    
def softmax_backward(dout, cache):
    """
    Backward pass for softmax
    :param dout: Upstream Jacobian
    :param cache: contains the cache (in this case the output) for this layer
    """
    s = cache
    ds = np.diag(s) - np.outer(s, s.T)
    dx = np.dot(dout, ds)
    return dx

Now that we have defined our layers, let’s use them in a forward and backward pass:

def forward_backward(x):
    layer_to_cache = dict()  # for each layer, we store the cache needed for backward pass    # Forward pass
    a1, cache_a1 = affine_forward(x, w1, b1)
    r1, cache_r1 = relu_forward(a1)
    a2, cache_a2 = affine_forward(r1, w2, b2)
    out, cache_out = softmax_forward(a2)    # backward pass
    dout = np.diag(np.ones(out.size, ))  # the derivatives of each output w.r.t. each output.
    dout = softmax_backward(dout, cache_out)
    dout = affine_backward(dout, cache_a2)
    dout = relu_backward(dout, cache_r1)
    dx = affine_backward(dout, cache_a1)
    
    return out, dx

Is the feedforward output correct?

out_fb = forward_backward(x0)[0]
out_tf = ffpass_tf(x0)np.allclose(out_fb, out_tf, 1e-4)>> True

And is our Jacobian matrix correct?

is_jacobian_correct(lambda x: forward_backward(x)[1], ffpass_tf)>> True

Good. Finally: how long does it take?

%timeit forward_backward(x0)>> 115 µs ± 2.38 µs

So where Autograd needed 3.7 ms, we now only need 115 µs. Much better :)

Conclusions

I have explored a few possible ways to compute the Jacobian matrix, using Tensorflow, Autograd and Numpy on CPU. Each method comes with different pros and cons. If you’re ready to specify your layers’ forward and backward passes, you can earn a lot performance-wise using Numpy directly — about ~5,000x for my toy network and example implementations. Of course your mileage will vary depending on your network architecture; typically, the larger the output dimension M the more you could win over methods that need to loop over M scalar outputs.
I find this good to know, as it is quite common that networks need to run efficiently at test time. In those cases it can be worth spending a bit of time to make sure their implementation is efficient. This is even more true when one needs to compute the Jacobian matrix in an interactive way. It’s also a good example why “yes, you should understand backprop”.

You can find all the code in a notebook here.

Note that my tests are by no means exhaustive, and it’s possible that there could be more efficient implementations. Finally, let me mention that during some latter experiments, I found PyTorch to perform close to autograd (also looping of the M outputs’ gradients).

[1]: Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(153):1–43, 2018