The Matrix Calculus You Need For Deep Learning | By Terence Parr & Jeremy Howard | Part II

Wei Wen
9 min readSep 6, 2022

--

Source

Hopefully you have gone through part one and have a clear idea of vector calculus and partial derivative, in this post, I will finish it up with the chain rule and the gradient of neuron activation.

The Chain Rules

The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute.

Single-variable chain rule

Chain rules are typically defined in terms of nested functions, such as y = f(g(x)) for single-variable chain rules. Here is the formulation of the single variable chain rule the authors recommended:

To deploy the single-variable chain rule, follow these steps:
1. Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., × is binary, sin(x) and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
2. Compute derivatives of the intermediate variables with respect to their parameters.
3. Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
4. Substitute intermediate variables back in if any are referenced in the derivative equation.

Let’s see how chain rule works with deeply nested expressions such as f₄(f₃(f₂(f₁(x)))) (this is how data flows through neural network (output of one layer become input to the next layer))

For example, y = f(x) = ln(sin()²) :

  1. Introduce intermediate variables. Let u = x² represent subexpression x²:

2. Compute derivatives.

3. Combine.

4. Substitute.

Here is a visualization of the data flow through the chain of operations from x to y:

Single-variable total-derivative chain rule

When it comes to a function like y = f(x) = x + x², the single-variable chain rule mentioned above will not work because derivative operator d/dx does not apply to multivariate functions.

A quick look at the data flow diagram for y = u₂(x, u₁):

A change in x affects y both as an operand of the addition and as the operand of the square operator. Here’s an equation that describes how tweaks to x affect the output:

So, the “law” of total derivatives says that to compute dy/dx, we need to sum up all possible contributions from changes in x to the change in y. The total derivative with respect to x assumes all variables, such as u₁ in this case, are functions of x and potentially vary as x varies.

The total derivative of f(x) = u₂(x, u₁) that depends on x directly and indirectly via intermediate variable u₁(x) is given by:

which can be simplified as:

Vector chain rule

Now let’s see how chain rule can be apply on vectors. Let’s use a vector function with respect to a scalar, y = f(x):

Let’s introduce two intermediate variables, g1 and g2, one for each fi so that y looks more like y = f(g(x)):

The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule:

Scalar rules
Vector rules

So, the complete vector chain rule is:

The following table summarizes the appropriate components to multiply in order to get the Jacobian:

The gradient of neuron activation

In a single neural network computation unit with respect to the model parameters, w and b:

Let’s focus on computing ∂/∂w (w x + b) and ∂/∂b (w x + b) (Ignore the max first) The dot product w · x is just the summation of the element-wise multiplication of the elements: Σ ⁿᵢ (wᵢxᵢ) = sum(wx).

Introduce an intermediate vector variable u:

The two subexpressions are already calculated in the previous section:

The vector chain rule says to multiply the partials:

Now, let y = w · x + b, the full expression within the max activation function call. We have two different partials to compute, but we don’t need the chain rule:

Let’s tackle the partials of the neuron activation, max(0, w · x + b). The max(0, z) function just says to treat all negative z values as 0. The derivative of the max function is a piecewise function. When z ≤ 0, the derivative is 0 because z is a constant. When z > 0, the derivative of the max function is just the derivative of z, which is 1:

An aside on broadcasting functions across scalars. When one or both of the max arguments are vectors, such as max(0, x), we broadcast the single variable function max across the elements:

For the derivative of the broadcast version then, we get a vector of zeros and ones where:

To get the derivative of the activation(x) function, we need the chain rule because w · x + b is a nested subexpression. Introduce intermediate scalar variable z to represent the affine function:

The vector chain rule tells us:

which we can rewrite as follows:

and then substitute z = w · x + b back in:

Turning now to the derivative of the neuron activation with respect to b, we get:

The gradient of the neural network loss function

Training a neuron requires that we take the derivative of our loss or cost function with respect to the parameters of the model, w and b.

We have 3 elements:

Input
Output
loss function

Then, introduces these intermediate variables:

The gradient with respect to the weights

From before, we know:

and

Then, for the overall gradient, we get:

To interpret that equation, we can substitute an error term eᵢ = w · xᵢ + b — yᵢ yielding

From there, notice that this computation is a weighted average across all xᵢ in X. The weights are the error terms, the difference between the target output and the actual neuron output for each xinput.

The resulting gradient will point in the direction of higher cost or loss because large eᵢ emphasize their associated x. Imagine we only had one input vector, N = |X| = 1, then the gradient is just

  • If the error is 0, then the gradient is zero and we have arrived at the
    minimum loss.
  • If e₁ is some small positive difference, the gradient is a small step in the direction of x₁.
  • If e₁ is large, the gradient is a large step in that direction.
  • If e₁ is negative, the gradient is reversed, meaning the highest cost is in the negative direction.

We want to reduce the loss, which is why the gradient descent recurrence relation takes the negative of the gradient to update the current position:

η = Learning Rate

Because the gradient indicates the direction of higher cost, we want to update x in the opposite direction.

The derivative with respect to the bias

The partial with respect to the bias for equation u(w, b, x):

For v, the partial is:

And for the partial of the cost function itself we get:

As before, we can substitute an error term:

The partial derivative is then just the average error or zero, according to the activation level. To update the neuron bias, nudge it in the opposite direction of increased cost:

In practice, it is convenient to combine w and b into a single vector parameter rather than having to deal with two different partials:

This requires a tweak to the input vector x as well but simplifies the activation function. By tacking a 1 onto the end of x,

w · x + b becomes

This finishes off the optimization of the neural network loss function because we have the two partials necessary to perform a gradient descent.

--

--