The Matrix Calculus You Need For Deep Learning | By Terence Parr & Jeremy Howard | Part I

Wei Wen
7 min readSep 6, 2022

--

Source

In this post I will try to summarize and explain all the essential mathematics concepts you will need in order to understand what happen under the hood when training a neural network by this paper.

Introduction

The activation of a single computation unit in a neural network (neuron) is typically calculated by:

Neural networks consist of many of these units, organized into multiple collections of neurons called layers. The activation of one layer’s units become the input to the next layer’s units. The activation of the unit or units in the final layer is called the network output.

When training a neuron, the goal is to slowly tweak the weight and bias so that the overall loss function is getting smaller across all the x inputs. In order to do that, we minimize a loss function by using gradient descent.

The gradient can be derived by differentiating the common loss function (mean-squared error):

target(x) = desired output; activation(x) = model’s output

But this is just one neuron, and neural networks must train the weights and biases of all neurons in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network outputs, we really need general rules for the derivative of a function with respect to a vector and even rules for the derivative of a vector-valued function with respect to a vector.

This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for training neural networks. This field is known as matrix calculus.

Review: Scalar derivative rules

Here are some of the main scalar derivative rules, if you need a refresher, have a look at Khan academy video on scalar derivative rules.

Introduction to vector calculus and partial derivatives

When it comes to function with multiple parameters such as f(x, y), we are actually talking about how would the product of xy changed when we wiggle the parameters and that depends on whether we are changing x or y.

We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two parameter functions (one for x and one for y). Instead of using operator d/dx, the partial derivative operator is ∂xy/∂x and ∂xy/∂y.

Consider function f(x,y) = 3x²y. The partial derivative with respect to x is treating y like a constant:

The partial derivative with respect to y treats x like a constant:

Here’s the Khan Academy video on partials if you need a refresher.
Then, organize them into a horizontal vector and this vector is the gradient of f(x,y) and write it as:

the gradient of f(x, y) is simply a vector of its partials.

Matrix calculus

When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let’s bring in g(x,y) = 2x + y⁸. The gradient for g has two entries, a partial derivative for each parameter:

and

giving us gradient g(x, y) = [2, 8y⁷].

If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:

numerator layout

or

denominator layout (transpose of the numerator layout)

Generalization of the Jacobian

To define the Jacobian matrix more generally, let’s combine multiple parameters into a single vector argument: f(x,y,z)⇒ f(x). Lowercase letters in bold font are vectors (x) and those in italics font are scalars (x). We’ll assume that all vectors are vertical by default of size n × 1:

Let y = f(x) be a vector of m scalar-valued functions that each take a vector x of length n = |x| (the cardinality (count) of elements in x). Each fᵢ function within f returns a scalar just as in the previous section:

It’s very often the case that m = n because we will have a scalar function result for each element of the x vector. For example, consider the identity function y = f(x) = x:

Generally speaking, though, the Jacobian matrix is the collection of all m × n possible partial derivatives (m rows and n columns aka m function n inputs), which is the stack of m gradients with respect to x:

5 inputs (x1…x5); 5 functions(f1…f5); 5 outputs (y1…y5)

The possible Jacobian shapes visually:

and an example if it helps:

The Jacobian of the identity function f(x) = x, with fᵢ(x) = xᵢ, has n functions and each function has n parameters held in a single vector x. The Jacobian is, therefore, a square matrix since m = n:

Derivatives of vector element-wise binary operators

Element-wise binary operations on vectors can be generalize with notation

y = f(w) O g(x) where m = n = |y| = |w| = |x|

The O symbol represents any element-wise operator (such as +, -, ⋅, /) Here’s what equation y = f(w) O g(x) looks like when we zoom in to examine the scalar equations:

Using the ideas from the last section, we can see that the general case for the Jacobian with respect to w is the square matrix:

and the Jacobian with respect to x is:

which can be simplify to:

More succinctly, we can write:

and

where diag(x) constructs a matrix whose diagonal elements are taken from vector x.

Any time the general function is a vector, we know that fi(w) reduces to fi(wi) = wi. For example, vector addition w + x fits our element-wise diagonal condition because f(w) + g(x) has scalar equations yi = fi(w) + gi(x) that reduce to just yi = fi(wi) + gi(xi) = wi + xi with partial derivatives:

Derivatives involving scalar expansion

Vector sum reduction

Let

The sum is over the results of the function and not the parameter.

The gradient (1 × n Jacobian) of vector summation is:

Let’s look at the gradient of the simple y = sum(x). The function inside the summation is just fᵢ(x) = xᵢ and the gradient is then:

Because ∂xᵢ/∂ xⱼ = 0 for ji, we can simplify to:

In the next post I will talk about the chain rule and the gradient of neuron activation.

--

--