How Fast-RCNN uses singular value decomposition(SVD) for fc-layer compression?

13 min readMay 6, 2024

Ever wondered how large sized neural networks are squeezed down to reduce the memory space occupied by the model?? One of the popular deep learning model Fast R-CNN address this issue by applying Singular Value Decomposition (SVD) technique.

Fast-RCNN is an object detection model built upon the previous R-CNN (Region-based Convolutional Neural Network) model and improves its speed and accuracy. Though it uses CNN layers for feature extraction process, the final layers are made of fully connected linear layers, which are responsible for predicting the class and bounding boxes of the objects. The inclusion of this fully connected layer significantly enlarges the model size. However, since the weight parameters of this fc-layers are stored as Matrix dictionary, we could leverage the advantage of SVD to decompose it.

Before jumping into how this technique have been adapted into Fast-RCNN, we need to have good understanding of how this singular value decomposition is performed on a matrix. In my previous blog post I have introduced all important matrix operations and decomposition concepts, so I will directly focus on SVD here. If you feel disconnected, I highly recommend you to refer this blog.

1. What and How SVD is performed?
2. How SVD could be used for Image compression?
3. Adaption of SVD into Fast-RCNN.

1. What and How SVD is performed?

Singular Value Decomposition (SVD) is a matrix factorization technique similar to Eigen/Spectral Decomposition but with an added advantage of decomposing non-square matrices. SVD makes use of the property of a matrix Ato generate symmetric matrices A.At and At.A, which are then utilized in its decomposition process.

A rectangular ‘matrix A’ can be decomposed by SVD technique into these three matrices.

Eq.1 Singular Value Decomposition representation

where,
A : Matrix to be decomposed
U : Orthogonal matrix (contains eigen vectors of A.At symmetric matrix)
Σ : Diagonal matrix (contains Square root of the eigen values obtained from the symmetric matrix A.At / At.A)
Vt : Orthogonal matrix (contains transpose of the eigen vectors of At.A symmetric matrix)

The Fig.1 shows how a rectangular matrix has been decomposed into three factor matrices using SVD. Lets find out how this decomposition can be performed step-by-step using simple numpy operations.

Step 1: Define the rectangular matrix to be decomposed

import numpy as np

# Matrix A with 3x4 shape is defined

A = np.array([[3,2,1,5],[5,6,7,9],[7,3,4,8]])

Step 2: Create a symmetric matrix using A.At

# To create a symmetric matrix, A.A_transpose is performed

A_A_t = np.dot(A, np.transpose(A))
A_A_t


array([[ 39,  79,  71],
       [ 79, 191, 153],
       [ 71, 153, 138]])

Step 3: Determine its eigen vectors and eigen values

# Eigen vectors and Eigen values for this symmetric matrix are determined

eig_val, eig_vec = np.linalg.eigh(A_A_t)
eig_root = np.sqrt(np.round(eig_val, 2))
eig_root = np.sort(eig_root)[::-1]
eig_root

array([18.85205559,  3.26802693,  1.38564065])

Since, the eigen values are determined from the multiplication of two matrix, the singular values are the square root of the eigen values. The singular values are then arranged in the order of their importance i.e, in descending sequence.

# Eigen vectors of this symmetric matrix (A.A_transpose) is one part of this decomposition

left = eig_vec
print(left)


array([[-0.31812709, -0.38644937, -0.86570898],
       [-0.72327789,  0.68928434, -0.04190693],
       [-0.61291455, -0.61281644,  0.4987903 ]])

These vectors form the left singular vectors of the decomposed factors.

Step 4: Create another symmetric matrix using At.A

# Another symmetric matrix is created by A_transpose.A operation

A_t_A = np.dot(np.transpose(A),A)
A_t_A

array([[ 83,  57,  66, 116],
       [ 57,  49,  56,  88],
       [ 66,  56,  66, 100],
       [116,  88, 100, 170]])

Step 5: Determine its eigen vectors and eigen values.

# Eigen vectors and Eigen values for this symmetric matrix are determined

eig_val, eig_vec = np.linalg.eig(A_t_A)

# Eigen vectors of this symmetric matrix (A_transpose.A) is used to create another part of this decomposition

right = eig_vec
right_t = np.transpose(right)
right_t

array([[-0.47003649, -0.36148052, -0.41548329, -0.68976168],
       [-0.61274266,  0.4664036 ,  0.60804111, -0.1931324 ],
       [ 0.4947056 , -0.35140754,  0.60394656, -0.51674719],
       [-0.39859567, -0.72685092,  0.30480845,  0.46893608]])

These vectors form the right singular vectors of the decomposed factors.

Note — Since both set of eigen vectors are determined from a symmetric matrix, they form orthogonal matrix.

Step 6: Cross check — if the decomposed matrices re-produces the original matrix

# Cross checking if the decomposed matrices produces the original matrix

svd_composed = left[:,:3]*eig_root[0:3]@right_t[:3,:]
np.round(svd_composed,2)

array([[3., 2., 1., 5.],
       [5., 6., 7., 9.],
       [7., 3., 4., 8.]])

From this we can confirm that the SVD has been performed successfully as the left singular matrix, right singular matrix and the corresponding eigen values have reproduced the original data matrix.

Geometric interpretation of SVD

Fig.2 Generalized Geometric representation of decomposing 3x4 matrix using SVD

In Fig.2, for a 3x4 matrix A, the SVD decomposed left singular vectors u1,u2,u3are arranged in columns whereas the right singular vectors v1,v2,v3are arranged in rows and the singular values of the corresponding vectors in the diagonal matrix as s1,s2,s3,s4.

Higher the singular value, Higher the information carried by the corresponding left and right singular vectors.

Therefore, by using top-k number of singular vectors along with their corresponding singular values, we could represent the original data.

Eq.2 Linear Combination equation representing the SVD technique

This SVD process is expressed as summation of product of each column of the U matrixwith its corresponding eigen value and row of the Vt matrix.

Eq.3 Approximating matrix A data using SVD technique

By considering only first two singular value terms in Eq.2, SVD could be performed to approximate matrix A as closely as possible.

Performing SVD using Numpy function

We can utilize the built-in numpy functions for performing SVD decomposition of the matrices.

import numpy as np

# Matrix A with 20x300 shape is defined

A = np.random.randn(20, 300)

# Perform SVD
U, S, Vh = np.linalg.svd(A, full_matrices=False)

# verify the approximation
np.allclose(A, np.dot(U * S, Vh))

True

In the above code, it is seen that the number of parameters in matrix A is 20*300 = 6000 but after decomposing, the total number of parameters are U=20*20 + S=20 + Vh=20*300 , which is equal to 6420, Nearly an increase of 1% in the number of parameters. This happens when we use the k value of the decomposition as min(m, n), where m and n are the rows and columns of the original matrix A.

So, it is advisable to take k << min(m, n). For example, if we take k=10 then the number of total parameters reduces to U=20*10 + S=10 + Vh=10*300 , which is equal to 3210, a decrease of 1.86 %. But, we should remember that with decrease in the value of k the approximation deteriorates. Therefore, it is important to balance the trade-off between the parameter size and the approximation accuracy.

Now, that we have a basic understanding of how SVD decomposes non-square matrices, lets try to apply the technique on an image and test the understanding of our concept.

2. How SVD could be used for Image compression?

One of the use case of this decomposition is to compress the images into lower ranks while preserving most significant information. By reducing the rank of the image matrix considerably low, we could reduce the memory size of the image of course with a proportional loss of information.

The magnitude of eigen value reflects the amount of information carried by the corresponding eigen vector. The eigen values are arranged in descending order in the diagonal matrix, hence by selecting top-k eigen values and their corresponding eigen vectors a new matrix is created with a smaller size and yet containing important information to reproduce the original Image.

— Lets try to understand the impact of applying SVD on an image. We are considering an RGB image here, since each channel is a matrix of numbers, which ranges between 0–255, we can apply SVD on each of these channels.

Fig.3 Image with dimension (876, 580, 3), where 3 is the number of channels

from PIL import Image
import matplotlib.pyplot as plt
import numpy as np


# Open the image
img = Image.open('zebra.png') 

# Display the image
plt.imshow(img)
plt.axis('off')  # Turn off axis
plt.show()

# Split the image into its RGB components
r, g, b = img.split()

plt.subplot(1, 3, 1)
plt.imshow(r, cmap='Reds_r')
plt.axis('off')
plt.subplot(1, 3, 2)
plt.imshow(g, cmap='Blues_r')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.imshow(b, cmap='Greens_r')
plt.axis('off')
plt.show()

We are splitting the RGB channels from the image and plotting each of these channels.

Fig.4 RGB channels are split and represented using cmap colors

# convert the RGB channels into numpy array

img_np_r = np.array(r)
img_np_g = np.array(g)
img_np_b = np.array(b)

# Find the rank of the array
rank_r = np.linalg.matrix_rank(img_np_r)
rank_g = np.linalg.matrix_rank(img_np_g)
rank_b = np.linalg.matrix_rank(img_np_b)

print("Rank of the red-array:", rank_r)
print("Rank of the green-array:", rank_g)
print("Rank of the blue-array:", rank_b)


Rank of the red-array: 580
Rank of the green-array: 580
Rank of the blue-array: 580

Each of these channels converted to numpy array from PIL type before performing the SVD operation.
The Rank value of each of these channel-array is noted to be 580 before decomposition.

# SVD for each channel

U_R, S_R, Vt_R = np.linalg.svd(img_np_r, full_matrices=False)
U_G, S_G, Vt_G = np.linalg.svd(img_np_g, full_matrices=False)
U_B, S_B, Vt_B = np.linalg.svd(img_np_b, full_matrices=False)

SVD operation has been performed for each of these channel matrix separately, which will be combined again in the next step to reproduce the original image.

def svd_compress(r,g,b):
    R_compressed = np.matrix(U_R[:, :r]) * np.diag(S_R[:r]) * np.matrix(Vt_R[:r, :])
    G_compressed = np.matrix(U_G[:, :g]) * np.diag(S_G[:g]) * np.matrix(Vt_G[:g, :])
    B_compressed = np.matrix(U_B[:, :b]) * np.diag(S_B[:b]) * np.matrix(Vt_B[:b, :])

    # clipping of values outside the pixel range
    R_compressed = np.clip(R_compressed, 1, 255)
    G_compressed = np.clip(G_compressed, 1, 255)
    B_compressed = np.clip(B_compressed, 1, 255)
    
    # Find the rank of the array
    rank_r = np.linalg.matrix_rank(R_compressed)
    rank_g = np.linalg.matrix_rank(G_compressed)
    rank_b = np.linalg.matrix_rank(B_compressed)

    print("Rank of the red-array:", rank_r)
    print("Rank of the green-array:", rank_g)
    print("Rank of the blue-array:", rank_b)

    # convert numpy to PIL
    R_pil = Image.fromarray(R_compressed.astype(np.uint8))
    G_pil = Image.fromarray(G_compressed.astype(np.uint8))
    B_pil = Image.fromarray(B_compressed.astype(np.uint8))
    
    return R_pil, G_pil, B_pil

The above function is defined to take different rank as input for each channel matrix and perform the recombination of the decomposed matrices.
The numpy clipping operation converts the values lesser than the lower bound as 1 while the values greater than the upper bound to 255 to avoid image distortion and then converted to PIL image.

RESULT IMAGES:

The following code is used to assign different ranks for decomposing different channels and combine them together to produce the final image.

# compressed using top 10, 10, 10 singular values

R_pil_10, G_pil_10, B_pil_10 = svd_compress(10, 10, 10)
merged_image_10 = Image.merge('RGB', (R_pil_10, G_pil_10, B_pil_10))
plt.imshow(merged_image_10)
plt.axis('off')
plt.show()


# compressed using top 5, 10, 5 singular values

R_pil_30, G_pil_30, B_pil_30 = svd_compress(5,10,5)
merged_image_30 = Image.merge('RGB', (R_pil_30, G_pil_30, B_pil_30))
plt.imshow(merged_image_30)
plt.axis('off')
plt.show()

Fig.5 Reconstructed images for different rank values

From the above output , it is seen that by increasing the number of singular values used for reconstructing the decomposed image array, we can achieve better approximation.

RESULT MEMORY SIZES :

Further, the memory size of the images before and after reconstruction have been noted to quantify the trade-off between memory saved for the information lost.

import os

# Get the size of the original image file in bytes
file_size_bytes = os.path.getsize('zeb_og.png')

# Convert bytes to kilobytes
file_size_kb = file_size_bytes / 1024  # 1 kilobyte = 1024 bytes

print("File size of the image in KB:", file_size_kb)

File size of the image in KB: 378.1826171

Tab.1 Memory size of images with different set of singular values

Thus, we could conclude that by manipulating the internal matrix structure of various channels in an image using SVD, the images could be decomposed and reconstructed using appropriate rank values.

3. Adaption of SVD into Fast-RCNN

(a) SVD applied on weight matrix of a fully connected layer

In a fully connected layer (FC-layer), the weight matrix 'W'typically has dimensions n_out, n_in, representing the number of output nodes and number of input nodes respectively.

This weight matrix can be decomposed into two matrices: a matrix that maps from the input space to a hidden space, and a matrix that maps from the hidden space to the output space. Let’s call these matrices W1 and W2, respectively.

So, instead of having just one weight matrix W, we have two weight matrices W1 and W2.An additional convolutional layer will be added to accommodate these decomposed matrices. The dimensions of W1would be n_hidden, n_inwhile the dimension of W2 would be n_out, n_hidden.

By introducing this additional hidden layer, we can potentially decrease the total number of parameters required and thus reducing the model size after training is performed.

(b) SVD in Fast-RCNN

The authors of Fast-RCNN paper have adopted truncated-SVD to be applied on the fully connected layers. Fast R-CNN introduced the Region of Interest (RoI) pooling operation, which efficiently extracts fixed-size feature maps from each region proposal. Though this pooling operation ensures that features extracted from different-sized regions can be fed into subsequent layers of the network, they have increased the number of parameters in the network architecture. Furthermore, they typically uses deep convolutional neural networks (CNNs), such as VGG, ResNet, or similar architectures, which contain millions of parameters.

Nevertheless, by decomposing the trained-weight matrix of the fully connected layers, the number of parameters in the inference model could be reduced. This is exactly what the authors have done in Fast-RCNN.

Instead of using the actual Fast-RCNN architecture to explain how this operation is done., I have decided to use a simple CNN model with fc-layers as it is vital to understand the steps performed for this operation rather than the architecture in which it is performed.

WE ARE GOING TO CREATE 4 files:

model.py
compressed_model.py
weight_transfers.py
svd_compress

1: A trained classification model is considered

model.py


import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(841, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


def layer_decompose():

    net = Net()

    # Count the number of parameters
    total_params = sum(p.numel() for p in net.parameters())
    print("Total number of parameters in the model:", total_params)
    
    fc3_weight = fc3_weight.detach().numpy()
    print(fc3_weight.shape, type(fc3_weight))

    return fc3_weight

Before performing the SVD operation on the fc-layers, the classification model or the detection model has to be trained with the required dataset.
Once the model has been trained, the last fc-layer is detached from the autograd function and converted to numpy. In the above code, fc3 layer has been considered.

— Note: we can perform this operation for any number of fc-layers in a model.

2: The same model with decomposed fc-layers is created

compressed_model.py

import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch

class compressed_Net(nn.Module):
    def __init__(self, k):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(841, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3_L = nn.Linear(84, k,  bias=False)
        self.fc3_U = nn.Linear(k, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)                # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3_L(x)
        x = self.fc3_U(x)

        return x

This model is created similar to our original trained model except for the following two changes:

The weights are randomly generated here — untrained model.
The final fc-layer has been split into two to accommodate the decomposed matrices.
The k value is the rank value given as the input to this model class, based on that the compressed model will be created.

3: Transfer weights from trained model to the un-trained model

weight_transfers.py

from model import Net
from compress_model import compressed_Net
import torch
from torch.nn.parameter import Parameter



def transfer_weights(k):
    
    source_model = Net()
    target_model = compressed_Net(k)

    # Define which parameters to transfer based on their names
    params_to_transfer = {}

    for name, parameters in source_model.named_parameters():
        # print(name, parameters.shape) 
        if not 'fc3.weight' in name:
            # Add a new key-value pair
            params_to_transfer[name] = name

    # print(params_to_transfer)

    # Transfer parameters without tracking gradients
    with torch.no_grad():
        for source_name, target_name in params_to_transfer.items():
            if source_name in source_model.state_dict() and target_name in target_model.state_dict():
                target_model.state_dict()[target_name].copy_(source_model.state_dict()[source_name])

    return target_model



def transfer_decomposed_weights(target_model, fc3_L_weight, fc3_U_weight):

    target_model.state_dict()['fc3_L.weight'] = Parameter(fc3_L_weight)
    target_model.state_dict()['fc3_U.weight'] = Parameter(fc3_U_weight)

    return target_model

This python file has two functions:

Transfer the trained weights from the trained model to the newly created un-trained model except for the fc-3 layer, which will be decomposed.
The weights decomposed from fc3 layer is transferred to the weight matrices offc3_L and fc3_U.

4: Main file for SVD operation

svd_compress.py

import numpy as np
from model import layer_decompose
from compress_model import compressed_Net
import torch
from weight_transfers import transfer_weights, transfer_decomposed_weights


# fc3 layer is obtained as numpy array
np_wght_matrix = layer_decompose()
r, c = np_wght_matrix.shape
k = min(r,c)

print(k)

rank = 5

# apply svd on the weight matrix
U, S, V_t = np.linalg.svd(np_wght_matrix, full_matrices=True)

# reconstructing the weight matrix with 'rank'
print(np.allclose(np_wght_matrix, np.dot(U[:, :rank] * S[:rank], V_t[:rank,:])))



# transfer weights from source model to new model : except for the decomposed layers
decomposed_model = transfer_weights(rank)

# transfer decomposed weight matrix to the fc-weights of new model
fc3_L_weight = np.transpose(V_t[:rank,:])  
fc3_U_weight = U[:, :rank] * S[:rank]

# Convert NumPy array to PyTorch tensor
fc3_L_weight = torch.tensor(fc3_L_weight)
fc3_U_weight = torch.tensor(fc3_U_weight)


# The Decomposed model used for inference application
decomposed_model = transfer_decomposed_weights(decomposed_model, fc3_L_weight, fc3_U_weight)



# Generating a random array
random_array = torch.randn(3, 128,128)

out = decomposed_model(random_array)



Total number of parameters in the compressed model: 114556
Total number of parameters in the model: 114926

In this main file, the SVD operation is performed on the fc3 layer, based on the user-defined rank input, which are then divided into fc3_L and fc3_U.
Other supporting files are used to transfer these decomposed weights back into the classification model.
By decomposing other fc-layers and reducing the rank sizes, we could further reduce the number of parameters.

In conclusion, Singular Value Decomposition (SVD) stands as a versatile mathematical technique with profound implications across various fields. We have explored its mechanics, from decomposing 2D matrices to its practical applications in image compression and fc-layer in convolutional neural networks (CNNs) like Fast R-CNN. Its integration into Fast R-CNN demonstrates how SVD contributes to model optimization, enhancing both performance and computational efficiency. In a world driven by data, SVD remains a cornerstone technique, empowering us to navigate and leverage the vast realms of information with precision and ingenuity.

Github : SVD image compression and FC-layer decomposition codes

How Fast-RCNN uses singular value decomposition(SVD) for fc-layer compression?

Contents

1. What and How SVD is performed?

2. How SVD could be used for Image compression?

3. Adaption of SVD into Fast-RCNN

References

Written by Anish Hilary