The singular value decomposition in a nutshell
Matrix decomposition is a ubiquitous technique in mathematics with a wide variety of practical and theoretical applications. Here a matrix is decomposed or factorized as the product of two or more matrices where the factors satisfy some desirable properties, depending on the nature of the original problem. So by matrix decomposition we mean a multiplicative decomposition, not an additive one (although additive decompositions can be also useful in some cases).
For instance the LU-decomposition and its variants (LDU, LUP, LLᵀ) are useful to solve linear equations of the form Ax = b, where A ∈ ℝⁿˣⁿ is a given square matrix, b ∈ ℝⁿ is a given vector. Here L is a lower triangular matrix with all 1’s in the diagonal and U is an upper triangular matrix.
By knowing that A can be written as A = LU, we can transform the above equation to LUx = b which is equivalent to the system Ly = b, Ux = y. The solution of this system is fast since U and L are triangular matrices, the computational cost for solving these systems (usually referred as forward and backward substitutions) is O(n²). The matrices L and U are essentially “by-products” of the Gaussian-elimination, where — loosely speaking — L stores the steps of the elimination process and U stores the resulting matrix after the elimination. So to solve Ax = b using Gaussian elimination directly requires 2/3 n³ flops, whilst creating A = LU and solving the two triangular systems result in 2/3 n³ + O(n²) flops that have the same magnitude, so it is not clear why this extra effort would make a difference. The gain we obtain is more clear if the task is to solve many linear equations having the same matrix A and only the right hand side b changes. In this case the elimination procedure, that is the creation of the LU-decomposition needs to be done only once. Another nice application of this decomposition is to calculate the determinant of A. Because of the forms of L and U:
so det(A) can be calculated in O(n³) flops as well (instead of using Cramer’s rule that has O(n⋅n!) asymptotic complexity which makes it practically useless).
This decomposition is one of the most widely used matrix decomposition technique used in applied mathematics. However, it is much lesser known in data science than in numerical analysis. The decomposition method that is as widespread in data science as LU in numerical analysis is something else which I would like to introduce in the following sections.
The singular value decomposition of an arbitrary matrix
What data scientists use quite often is the singular value decomposition which can be found behind linear regression and least square methods, and a useful technical tool for solving linear systems that have no unique solution (Moore-Penrose pseudoinverse), performing principal component analysis, calculating low-rank approximations. There is also a plethora of real world applications of singular value decomposition such as image compression, recommender systems, numerical weather forecast or natural language processing.
In what follows we would like to introduce the concept of the singular value decomposition (SVD for short) and illustrate it by showing some applications.
Let A ∈ ℝⁿˣᵐ be an arbitrary (not necessarily a square) matrix. It can be complex valued as well, but in the examples we are going to deal with real matrices only. Then there exist matrices U ∈ ℝⁿˣⁿ, D ∈ ℝⁿˣᵐ and V ∈ ℝᵐ ˣᵐ, such that
where U and V are unitary matrices, that is U*U = UU* = Iₙ and V*V = VV* = Iₘ and D is a diagonal matrix, that is dᵢⱼ = 0 if i ≠ j. The star operation means conjugate transpose, that is
but since we are dealing with real matrices now, this is the same as the transpose of the matrix. The diagonal elements in D are nonnegative numbers, in decreasing order: dᵢᵢ = σᵢ, σ₁ ≥ σ₂ ≥ … ≥ σᵣ > σᵣ₊₁ = … = σₘᵢₙ₍ₙ,ₘ₎ = 0, where r is the rank of the matrix A. These σ values in the diagonal of D are called the singular values of A.
Before we would go into more details, I would like to show how this decomposition can help to compress an image. We will rely on the following property of the SVD-decomposition.
Low-rank approximations of A
Let k ∈ ℕ a given natural number, where k ≤ rank(A) ≤ min{n,m}. What we look for is a matrix Aₖ having rank(Aₖ) = k which is the best approximation of A among the matrices that have rank equals to k. To formulate the low-rank approximation problem, we would like to solve the following minimalization problem:
Here ||X||ꜰ denotes the Frobenius norm of a matrix X which is the squareroot of the sum of squares of the elements of X.
The solution of this problem can be obtained from the SVD-decomposition of A. If A = UDV*, then we keep the first k values in D as is and set the subsequent singular values to zero. Let us denote the resulting diagonal matrix by Dₖ. It is easy to see that we only have to keep the first k columns of U and the first k rows of V, since their other columns would be multiplied by zeros anyway. To sum up, the matrix Aₖ := UₖDₖVₖ* is the closest matrix to A (in Frobenius norm) having rank k, where Uₖ and Vₖ consist of the first k columns and rows of U and V, respectively.
How can this knowledge be useful? Well, if A is a large matrix, that is n, m are large and k is relatively small, then the information we need to store to approximate the information content stored in A is much smaller. That is, we can reduce the storage space significantly and we are still able to store almost the same information that the original matrix has.
Welcome to the lab
I would like to stop here, as the number of mathematical expression becomes intractable, and let’s do some experiments with a computer. We will see how the SVD-decomposition can be created in Python, how to compute the best k-rank approximation of a matrix and later, in the second part we will see how to use this machinery to compress an image.
I would like to illustrate the above concepts on a toy example. We define a matrix of size 4 × 2 which has rank 2, and we create its rank-1 approximation using the SVD-decomposition. For this purpose we have created an Ipython notebook where all these steps described above can be followed step by step. The notebook can be downloaded from this location.
In the first screenshot we have created our test matrix with rank 2 and performed the SVD-decomposition on it.
Then we check if we can restore the original matrix from the factors (up to round-off errors). We also define another matrix B which has rank 1 and seems to be a good candidate for a rank 1-approximation of A.
Finally, we create the real rank 1 approximation of A, calculate the Frobenius norm of the residual matrix. We can then see that by using the matrix that minimizes the objective function the minimum value of this function is much smaller than using the naive approximation matrix B.
Next week we will continue with an interesting application of the SVD-decomposition, namely, how to compress images.
Originally published at www.balabit.com on April 7, 2016 by Tamás Kurics.