Singular Value Decomposition (SVD)

Published in

Vision and Language Group

6 min readMay 25, 2020

In this article, in the quest of understanding SVD, we will first see what it is mathematically. Even if the mathematical portion seems overwhelming, I insist on reading further to understand its intuitive reasoning and see how SVD helps us in Data Science as well. The readers should have a basic understanding of eigenvalues and eigenvectors to understand the mathematical portion of this article thoroughly.

SVD is a matrix factorization method, i.e. it is a method to represent a matrix as a product of matrices. This representation is used in a wide range of applications, including compressing and data reduction. SVD makes these processes reasonably easy due to some exceptional properties of this decomposition. Let us take a look at them.

Mathematics

The mathematical representation of a “transformation” (I will get to it later) is

where,

U is an m × m unitary matrix (left singular vector)
S is an m × n diagonal matrix with non-negative real numbers.
V is an n × n unitary matrix ( right singular vector)
Vᵀ is the conjugate transpose of the n × n unitary matrix.
The diagonal values of S are known as “singular values” of A.

Let us understand a few terms used here:

Unitary Matrix: We can think of it as a “complex number” version of an orthogonal matrix, which is a square matrix whose columns and rows are orthonormal vectors.
Diagonal Matrix: It is a matrix in which the entries outside the main diagonal are all zero.
Conjugate Transpose: The conjugate transpose of a matrix interchanges the row and column index for each element.
Singular Values: It denotes the square root of the eigenvalues of XXᵀ where X is a matrix.

But none of it seems understandable, right?? It is just pure mathematics. We get no information about how we can visualize it and use this mathematical tool. So, let us now take a step towards it.

Understanding SVD intuitively

We know that a matrix is a transformation which transforms the input vectors. We can understand it by saying that it is a function that takes a vector as an input and gives another vector out. Hence it works as a map between them.

Now, say we have to transform a vector V (think of it as applying function A on V) in space. We can do it by ‘rotating’ it and ‘scaling’ it. Hence, to put it mathematically:

AV = US (here, U and V are orthogonal, i.e., rotatory matrices)

A= USVᵀ,

This is because of the property of orthogonal matrices which says, ‘An orthogonal matrix multiplied with its transpose is equal to the identity matrix’, i.e.,

Hence, we can decompose a transformation (here, A) into a scaling sandwiched between two rotations. Let us take a simple example to make it more intuitive.

Suppose a square has to be transformed (A) into a parallelogram. We can rotate clockwise by θ (U), scale the axes by different factors (S), then rotate counterclockwise again by θ (Vᵀ). And voila! We have a parallelogram!

Note: Since the square had to be transformed into a “parallelogram” the angle θ is the same here, it is not always the case. They can be unique as well.

This is how we can imagine transformation on a vector:

Now, if we decompose A, we will get something like this:

Decomposing the transformation (A) into ***U, S, and Vᵀ***

To sum it, the transformation can be broken down into a transcription from one coordinate system to another, dictated by the orthogonal columns of V and U, and a scaling dictated by the diagonal S.

Also, we see that if we multiply A and Aᵀ, we get:

AAᵀ= (USVᵀ)(VSᵀUᵀ)

From the Orthogonal property of V (and U) we discussed above:

AAᵀ= U(SSᵀ)Uᵀ

We ultimately get our usual representation of a matrix (here, AAᵀ) as a multiple of its Eigenvalues, SSᵀ (λ), and Eigenvector, U! Similarly, we can have AᵀA to represent in terms of V.

Data science and SVD

Whether we want to compress our images or prevent our data from over-fitting, we come across the need for dimensionality reduction many a time. SVD is a widely used method in data science for this very purpose. But so is PCA, so what is the difference, or is there a relationship between both? Let us try to assess that with an example.

Suppose you have some observation that depends on a lot of variables, of which only certain combinations make a significant contribution to the observation. What PCA does is that it projects everything to those “strong” combinations (these are called the principal components). Now, to do that, we can:

Use eigenvalue and eigenvector of the covariance matrix to calculate and rank the importance of features.
Use SVD on the covariance matrix to calculate and rank the importance of the features.

Let us see the difference mathematically (consider all the data preprocessed to have zero mean) :

So, while performing PCA, we are required to compute the eigenvalues and eigenvectors of the covariance matrix, which is the product
(XXᵀ)/(n-1), where X is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

(XXᵀ)/(n-1) = (WDWᵀ)/ (n-1)

On the other hand, applying SVD to the data matrix X as follows:

X = UΣVᵀ

and attempting to construct the covariance matrix from this decomposition gives:

and since V is an orthogonal matrix (VᵀV=I),

and the correspondence is easily seen. For example, the square roots of the eigenvalues of XXᵀ are the singular values of X (as mentioned earlier in the article).

But, why do we prefer to use the SVD to perform PCA over calculating it directly? This is because the formation of XXᵀ can cause a loss of precision, for example, if X is the matrix given below:

where ϵ is a tiny number.

Here, some elements of XXᵀ will be of the order ϵ² will not be precise. Hence using SVD here would be beneficial.

Hence, SVD is used to reduce a rank R matrix to a rank K matrix, i.e., we can take a list of R unique vectors, and approximate them as a linear combination of K novel vectors to aid dimensionality reduction and make our lives easier!

If you want to explore more, you can read about some other matrix decomposition techniques as well such as LU decomposition, LU reduction, Cholesky decomposition, and QR decomposition.

Bonus: Here is an interactive tool for you to play around with and understand SVD better.

References:

Video Lecture by Gilbert Strang.
MIT tutorial

Thanks to Aaryan Garg for the animations.

Singular Value Decomposition (SVD)

Mathematics

Understanding SVD intuitively

Data science and SVD

References:

Written by Kaaira Gupta