Low rank Approximation for 4D kernels in Convolutional Neural Networks through SVD

Anish Hilary
7 min readMay 14, 2024

--

In this blog post, we’ll explore a research paper that demonstrates the use of Singular Value Decomposition (SVD) to decompose 4D kernels in CNNs. We’ll break down the concepts, discuss key findings, and explore how this technique enhances the efficiency and effectiveness of deep learning models.

Pre- requisites:

  1. basic understanding of convolutional networks
  2. singular value decomposition technique (refer)

Training a Convolutional Neural Network on the input dataset requires updating the network parameters. These learnable weight parameters are termed as kernels in CNN whose dimension is given by (N, C, d, d).

where,

N corresponds to number of output channels.

C corresponds to number of input channels.

dxd corresponds to spatial dimension of the kernel.

One important insight is that during the training phase of these 4D weight tensors a lot of repetitive information will be stored in the model. Methods that aids in breaking down these tensors appear to offer a promising solution for reducing this redundancy.

Low Rank Approximation is a technique, which breaks down higher dimension tensors by reducing their rank. That said, performing low-rank tensor decompositions such as CP decomposition, Tucker decomposition can be challenging, primarily due to difficulty in finding its best ‘k’ rank approximation, the need to perform iterative operation and their non-convex nature. The low-rank constraint and multilinear structure of tensors inherently induce non-convexity when approximating tensors into low-rank forms.

why is it difficult to solve non-convex problems?

Non-convex functions can feature numerous local minima, unlike convex functions, which possess only a single global minimum . So, we can be sure that the minimum value we achieve in solving convex problem is the global minimum while we cannot be sure in case of the non-convex problems as seen in Fig.1.

Fig.1 Plot of functions with convex and non-convex surfaces

However, By employing Singular Value Decomposition (SVD), we can attain low-rank approximation for matrices. SVD provides a non-iterative, closed-form solution for decomposing any matrix into its constituent orthogonal matrices and singular values through standard algebraic operations.

If you are wondering what does it mean by closed form solutions, here are some examples,

  1. Finding roots of a quadratic equation.
  2. Solving system of linear equations.
  3. Geometric problems for solving areas, volumes, centroids of basic geometric shapes like rectangles, triangles etc.

As seen from the examples, we can say that all these problems have mathematical expression or formula that provides an exact solution using well-defined operations and functions. SVD also falls into this category for decomposing a matrix into its component elements.

Eq.1 SVD resulting in closed form solutions

So, we can adopt SVD to perform low rank approximation without any non-convexity problem. But wait…

SVD can be applied only on 2D matrices Not for higher order tensors. Our CNN kernel is a 4D tensor.

To address this issue, the authors have come up with a simple yet highly effective solution. That is, to convert the 4D tensor into 2D matrix. How it is done.?

Step 1:

For a 4D kernel tensor N, C, d, d , Transpose its dimensions such that we get C, d, d, N.

Step 2:

Now, combine the first two dimension and last two dimension together. So that our kernel will become a matrix of dimension (C*d, d*N).

Reason for combining the dimensions in such a way:

  • While performing the SVD decomposition, the U part which includes the vertical eigen vectors will contain the C dimension, which corresponds to the number input channels.
  • Now, this U part containing the decomposed elements from the input channel dimension will directly perform convolution operation on the input feature map.
  • This product will then perform dot product operation on the decomposed V part which includes horizontal eigen values.
  • This V part contains decomposed elements from the corresponding output channels.

So, instead of performing the Hadamard product directly on the input feature map with all the elements in the weight kernel, by decomposing the tensor, the number of parameters and the number of operations are reduced multi-folded depending on the rank value used for decomposition.

Step 3:

Once the SVD decomposition is performed on this matrix, the 2D weight matrices have to be converted to Two 4D tensors again to be able to be fitted into a compressed model, which has an additional layer.

For the given rank value k used for decomposition, dimension conversion is done by reshaping the decomposed matrices.

Fig.2 Weight Dimensions Before and After decomposition

The U part of the matrix is reshaped to occupy the lower conv-layer while the V part of the matrix is reshaped to occupy the upper conv-layer.

Eq.2 Condition for choosing rank value ‘k’

The number of parameters could be reduced if the condition in Eq.2 is satisfied for selecting the rank for decomposition.

Thus, using the SVD technique the 4D weight kernels in the CNN layers could be decomposed.

CODE explanation for performing SVD on 4D kernels

A practical example of the above explained technique is demonstrated using a simple CNN model, which is trained using MNIST dataset.

Complete Code available in GitHub repo

Fig.3 CNN model (left) ; compressed CNN model (right) using rank value ‘k’

An overview on steps to performs this decomposition is explained below,

Step 1:

  • Create a normal CNN model and train the weight parameters until convergence to classify the images in MNIST dataset.

Step 2:

  • From the trained model access the conv2 layer weights, which is of dimension N, C, d, d.
  • Transpose and Reshape the dimension to (C*d, d*N).
    out_channel, in_channel, k1, k2 = wght_tensor.shape
wght_matrix = np.transpose(wght_tensor, (1,2,3,0))
# Reshape to combine the first two dimensions and the last two dimensions
wght_matrix = wght_matrix.reshape(wght_matrix.shape[0]*wght_matrix.shape[1], wght_matrix.shape[2]*wght_matrix.shape[3])

Step 3:

  • Perform SVD on this matrix and reshape it back to 4D tensor
  • lower weight shape is k, C, d, 1
  • upper weight shape is N, k, 1, d
  • Convert them back to torch tensor, so that it can be accommodated as parameters in the CNN model.
# apply svd on the weight matrix
U, S, V_t = np.linalg.svd(wght_matrix, full_matrices=True)

# reshape and transpose to fit in lower conv layer
lower = U[:, :rank] * np.sqrt(S[:rank])
lower = lower.reshape(in_channel, k1, 1, -1)
lower = lower.transpose(3,0,1,2)
#print(lower.shape)

# reshape and transpose to fit in upper conv layer
upper = V_t[:rank,:]* np.sqrt(S)[:rank, np.newaxis]
upper = upper.reshape(-1, 1, k2, out_channel)
upper = upper.transpose(3,0,1,2)
#print(upper.shape)

# Convert NumPy array to PyTorch tensor
lower = torch.tensor(lower)
upper = torch.tensor(upper)

Step 4:

  • Create the weight names to transfer these decomposed weights to the decomposed model
    wghts_to_decompose = {'conv2.weight':10}
decomposed_tensors = {}

# Iterate through the keys and check for the name in wghts_to_decompose
for parameter_name, wghts in state_dict.items():
if parameter_name in wghts_to_decompose.keys():
lower, upper = svd_decompose(wghts, wghts_to_decompose[parameter_name])

# after weight decomposition, new weights are paired with their corresponding names
parts = parameter_name.split('.')
parts[0] = parts[0]+'_L'
new_name_L = '.'.join(parts)
parts = parameter_name.split('.')
parts[0] = parts[0]+'_U'
new_name_U = '.'.join(parts)

decomposed_tensors[new_name_L] = lower
decomposed_tensors[new_name_U] = upper

Step 5:

  • Now, both the trained weights and the decomposed weights are transferred to the compressed model created as in right side of the Fig.3

Step 6:

Though SVD operation itself is convex, its use within broader low-rank tensor approximation problems can still induce non-convexity due to additional constraints, regularizers, also due to its shape change to 4D tensor.

Therefore, we need to fine-tune this new compressed model., which has reduced number of parameters.

Intuitively, even though low rank constraints induce non-convex nature in the model, by using SVD decomposed weights as initialization, it is highly likely that we could end up reaching the best optimal minima position.

Results

Table.1 Result tabulation on parameter reduction

In a small CNN model, by decomposing one layer and varying the rank values, parameters in that layer was reduced to different levels. By applying this same technique on all the layers in a large Convolutional network we could achieve sufficient reduction in the model parameter size.

Note: The presence of additional layers significantly complicates numerical optimization causing issues such as exploding and vanishing gradients, particularly in larger networks. To tackle this challenge, Batch Normalization (BN) is used. It normalizes the activations of internal hidden units, thereby offering an effective approach to mitigate issues related to exploding or vanishing gradients.

I have used only snippets of my entire code for the explanation in this blog, for better clarification, refer to the python files in the GitHub repository.

GitHub repo: Low Rank Approximation of 4d kernel using SVD

References

  1. CNN with low rank approximation
  2. Non-convexity of Low rank Tensors

--

--

Anish Hilary

Curious about the mathematical concepts that drive the world of artificial intelligence