Using Matrix to Represent Fully Connected Layer and Its Gradient

FunCry
5 min readFeb 20, 2023

--

Introduction

In this era of rapid machine learning development, you can easily find a instruction video on Youtube. These videos typically use diagrams similar to the one above to teach about neural networks. Although it looks simple, you can easily encounter difficulties during development if you do not have a good understanding of the math behind it. This article aims to bridge this gap by explaining how to represent fully connected layers using matrices and how to calculate its gradient.

Prerequisites:

  1. Fully Connected Layer
  2. Gradient Descent
  3. Chain Rule
  4. Matrix Multiplication

Using Matrices to Represent Fully Connected Layer

Pytorch provides the nn.Linear(input_dim, output_dim) method to initialize a fully connected layer, where Linear refers to a linear transformation.

Since any linear transformation can be represented by a matrix, it is natural to use the following formula:

where X is the input, represented by a row vector. W is the weight for linear transformation, with dimensions input_dim * output_dim

The reason we use XW instead of WX is because conventionally, a row is used to represent a piece of data in computer science.

Finding Elements in W

After drawing the above diagram, a natural question to ask is what wᵢⱼ represents. To answer this, we first need to observe the following graphs:

From these two graphs, we can see that the first column of W represents all the weights connected to z₁ and the second column represents those connected to z₂. The weight connecting xᵢ and zⱼ is represented by wᵢⱼ, and we can locate each weight in the matrix by this logic.

Find Gradient for W

During gradient descent, the ideal approach is to create another matrix, ∂L/∂W, this matrix serves as a look-up table. The (i, j) entry of this matrix represents the gradient of wᵢⱼ, which allows us to update the weights one matrix at a time using the formula mentioned above.

To create this matrix, let’s first observe the following formula:

Using the chain rule formula mentioned above, we can see that for every weight w, its gradient is the product of the input x conneted to it and the gradient of its connected output z.

If we perform the calculation for each weight wᵢⱼ, we arrive at the following formula:

The last line gives us the formula for calculating the gradient of W.

Batch Gradient Calculation

When implementing machine learning, we often use batch calculation to speed up training.

To understand how the formula for calculating the gradient behaves in a batch calculation, we can examine its behavior when the batch size is equal to 1 and n.

Batch Size = 1

When batch size is equal to 1, the result is exactly the same as the one we derived earlier.

Batch Size = n

When the batch size is greater than 1, the result equals the batch gradient summed together. If you want the average gradient of the batch, don’t forget to divide the result by n.

Find Gradient for X

The formula for ∂L/∂W that we derived includes ∂L/∂Z. In a neural network with multiple layers, Z becomes the input vector X for the next layer after activation function. Therefore, it is necessary to calculate the gradient for X as well.

As usual, we begin by observe the following graphs:

From the graph above, we can see that the graident for x is the product of all the w connected to it and its corresponding ∂L/∂zⱼ summed together. If we examine the weight matrix we derived earlier, we can see that the i-th row of W represents all the weights connected to xᵢ, which means the gradient for xᵢ is the inner product of the i-th row of W and ∂L/∂Z.

We arrived at the following formula after some simple calculation:

The Gradient for Bias Term

We have omitted the bias term in the previous discussion for the sake of clarity, but after some simple calculations, we can see that its gradient is the same as ∂L/∂Z.

It is important to note that the gradient for the bias term should still be taken into consideration when using a batch size greater than one. Further considerations on how to adjust the bias term gradient can be left to the reader.

Conclusion

This article provides an explaination on how matrices can be used to represent a fully connected (Linear) layer, and how to calculate the gradient of the weights and inputs. However, when implementing backpropagation from scratch, it’s also important to consider how to connect each layer and define the derivative function of each activation function. When the layer is not linear, one should also think about the gradient, but with a solid understanding of the content discussed above, this should be a piece of cake for you!

--

--