Implementing SVD algorithm in Rust
Singular Value Decomposition also popularly known as SVD is one of the fundamental algorithms used in ML for doing low rank decomposition of large matrices. Some notable applications of SVD are as follows:
- Doing dimensionality reduction of large and sparse matrices used in document search and retrieval systems such as search engines. SVD helps to reduce the time complexity for search.
- In recommender systems, SVD can be used to find latent vectors for users and items. The latent vectors can then be used to find similar users and items.
- Image compression by reducing the rank of the original image matrix.
- Finding pseudo-inverse of non-square matrices. This is useful for many regression problems where the number of features (or variables) may not be equal to the number of training instances.
In this post I’m not going to deep dive into the applications, as I just want to focus of how to implement SVD algorithm from scratch. If you are interested in understanding more about the applications, check out my slides here:
I am assuming that you are familiar with what SVD is and what are the different components of SVD. For refresher, a matrix A of dimensions NxM can be decomposed into 3 matrices U, S and V as follows:
U is NxN matrix, S is NxM matrix and V is MxM matrix.
Columns of matrix U are the eigenvectors of A.A^T and columns of matrix V are the eigenvectors of the matrix A^T.A
The matrix S is a diagonal matrix where the diagonal elements are the square root of the eigenvalues of the matrix A.A^T (or A^T.A).
Thus, we now have sufficient information to calculate the matrices U, S and V. We need to find the eigenvectors of the matrices A.A^T and A^T.A and the eigenvalues for any one of those matrices. That brings us to the question of how to find eigenvalues and eigenvectors of a matrix.
If you are a Pythonista, you might find them quite easily using the Numpy library as follows:
import numpy as np
n, m = 100, 100
a = np.random.normal(0,1,(n,m))
eigenvalues, eigenvectors = np.linalg.eig(a)
print(eigenvalues)
print(eigenvectors)
Note that the eigenvalues and eigenvectors may be complex numbers. This is because when we are finding the eigenvalues, we are actually solving the following equation:
det(A-yI) = 0
y are the eigenvalues of the matrix A. det() is the determinant of matrix. Thus, we will obtain a polynomial in y. Not all roots of y will be real numbers.
But note that if A is a symmetric matrix, then all the eigenvalues are real numbers and also the eigenvectors are all real. The matrices A.A^T and A^T.A are symmetric matrices and thus the eigenvalues and eigenvectors are real.
Thus, it is pretty easy to calculate the matrices U and V using the above.
n, m = 100, 100
a = np.random.normal(0,1,(n,m))
eigenvalues_u, u = np.linalg.eig(a @ a.T)
eigenvalues_v, v = np.linalg.eig(a.T @ a)
s = np.diag(np.sqrt(eigenvalues_u))
np.testing.assert_allclose(u @ s @ v.T, a)
Did you notice anything after running the above code ? You will most likely get assertion error. But why ?
This is because of sign errors of the eigenvectors.
Note that if you revert the sign of an eigenvector, it will still be a valid eignvector of the matrix. This is because in the eigenvector equation:
(A-yI).x = 0
Multiplying both sides by -1 has no effect.
If we compute U and V independently, we will land into this trouble where multiplying U, S and V^T does not give back the original matrix A. To overcome this problem, we can compute V first and then compute U:
S inverse is 1.0 over the original values of S for all non-zero diagonal elements, for zero diagonal elements, we can keep them as 0.
n, m = 100, 100
a = np.random.normal(0,1,(n,m))
eigenvalues_v, v = np.linalg.eig(a.T @ a)
s = np.diag(np.sqrt(eigenvalues_v))
s_inv = np.where(s == 0, 0, 1.0/s)
u = a @ v @ s_inv
np.testing.assert_allclose(u @ s @ v.T, a)
But we shall see that the above approach is not a stable approach when the values of matrix A.A^T or A^T.A are smaller than the machine precision. For e.g. in the above code, if we change the values of matrix A to be sampled from 0 to 1e-200, we will notice that the approach fails with assertion error.
n, m = 100, 100
a = np.random.normal(0,1e-200,(n,m))
eigenvalues_v, v = np.linalg.eig(a.T @ a)
s = np.diag(np.sqrt(eigenvalues_v))
s_inv = np.where(s == 0, 0, 1.0/s)
u = a @ v @ s_inv
np.testing.assert_allclose(u @ s @ v.T, a)
AssertionError Traceback (most recent call last)
Cell In[87], line 9
6 s_inv = np.where(s == 0, 0, 1.0/s)
7 u = a @ v @ s_inv
----> 9 np.testing.assert_allclose(u @ s @ v.T, a)
[... skipping hidden 1 frame]
File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\numpy\testing\_private\utils.py:862, in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf, strict)
858 err_msg += '\n' + '\n'.join(remarks)
859 msg = build_err_msg([ox, oy], err_msg,
860 verbose=verbose, header=header,
861 names=('x', 'y'), precision=precision)
--> 862 raise AssertionError(msg)
863 except ValueError:
864 import traceback
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 25 / 25 (100%)
Max absolute difference: 1.90398337e-200
Max relative difference: 1.
x: array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],...
y: array([[-1.834095e-200, -2.743976e-201, -1.228526e-200, 1.444880e-200,
1.197758e-200],
[-5.629092e-202, -6.866588e-201, -1.684314e-201, 8.075045e-201,...
Instead, we will look into one of the earliest stable algorithm for SVD calculations known as the Golub-Kahan-Reisch SVD algorithm.
In the rest of this post we will understand this algorithm and implement it using Rust. Note that this algorithm is not the fastest out there and many algorithms such as the divide and conquer approach are used in standard packages.
The idea behind this algorithm is to multiply the matrix A by some orthogonal matrices L1, L2, … Lk on the left and some orthogonal matrices R1, R2, ..Rk on the right to form a diagonal matrix say X, i.e.
Then we can write A as follows:
The above follows from the fact that the matrices L and R are orthogonal i.e. L.L^T = I and L^T.L=I and similarly for R.
If we denote the following:
and
Then we can rewrite A as follows:
This form looks familiar and it is infact the SVD of matrix A because X is a diagonal matrix and U and V are orthogonal. Now the only question remaining is how to find these orthogonal matrices L and R ?
Let’s deviate a bit from here into another related algorithm.
Remember QR decomposition from linear algebra class ?
QR decomposition of a NxM matrix A, produces 2 matrices Q and R where Q is a NxN orthogonal matrix and R is a NxM upper diagonal matrix.
The dimensionality of Q and R can be varied depending on the use case. For e.g. if M < N, we can instead use NxM matrix for Q (first M columns) and MxM matrix for R (top M rows).
Note that this will not change the results because is N > M and R is an upper diagonal matrix, then it is filled with all 0s in the lower (N-M)xM matrix which does not add any value.
There are 2 very popular algorithms to perform QR decomposition:
- Householder reflection
- Givens’ rotation
In Householder reflection method, firstly for column 0, all values in rows 1 to n-1 (0-indexed) are zeroed out, then for column 1, rows 2 to n-1 are zeroed out and so on, till we have zeroed out all sub-diagonal elements.
The algorithm is as follows:
Let G = A - NxM matrix
Let H = I - NxN identity matrix
For each column i from 0 to min(N-1, M-1), repeat steps 1 to 6:
1. Let x = G[i:,i]
2. Let l2 = L2_norm(x)
3. Update x[0] = x[0]-sgn(x[0])*l2
4. Let u = x/L2_norm(x)
5. Update G[i:,i:] = G[i:,i:] - 2*(u @ (u.T @ G[i:,i:]))
6. Update H[i:,:] = H[i:,:]- 2*(u @ (u.T @ H[i:,:]))
Let K = min(N, M)
return Q=H.T[:,:K], R=G[:K,:]
Time complexity of the above algorithm is O(NM²) ignoring the constant terms and lower order terms. Thus, if M > N, then it is beneficial to perform QR decomposition on the transpose of A.
More elaborate explanation can be found in these slides:
Applied Numerical Linear Algebra. Lecture 9 (chalmers.se)
It can be shown that once a column has been zeroed out below its main diagonal, the algorithm for subsequent columns do not affect the already zeroed out columns.
Implementing the above algorithm using Rust:
// returns sign of a real number
fn sgn(x:f64) -> f64 {
if x < 0.0 {
return -1.0;
}
return 1.0;
}
// generates an nxn identity matrix
fn identity(n:usize) -> Vec<f64> {
let mut q = vec![0.0;n*n];
for i in 0..n {
q[i*(n+1)] = 1.0;
}
return q;
}
// transpose a matrix
fn transpose(a:&[f64], n:usize, m:usize) -> Vec<f64> {
let mut b = vec![0.0;n*m];
for i in 0..n {
for j in 0..m {
b[j*n+i] = a[i*m+j];
}
}
return b;
}
// householder reflection algorithm
fn householder_reflection_left_multiply(a:&[f64], n:usize, m:usize) -> (Vec<f64>, Vec<f64>) {
let mut q_lt = identity(n);
let mut r = a.to_vec();
let w = min(n, m);
for i in 0..w {
let n1 = n-i;
let mut nm = 0.0;
let mut u = vec![0.0;n1];
for i1 in i..n {
// compute the L2 norm
nm += r[i1*m+i]*r[i1*m+i];
u[i1-i] = r[i1*m+i];
}
u[0] -= sgn(r[i*m+i])*nm.sqrt();
nm = (nm-r[i*m+i]*r[i*m+i]+u[0]*u[0]).sqrt();
if z > 0.0 {
for i1 in 0..n1 {
u[i1] = u[i1]/nm;
}
// r1 stores the values of u.T @ R for each column
let mut r1 = vec![0.0;m-i];
for i1 in i..n {
for j1 in i..m {
r1[j1-i] += u[i1-i]*r[i1*m+j1];
}
}
// update the matrix R
for i1 in i..n {
for j1 in i..m {
r[i1*m+j1] -= 2.0*u[i1-i]*r1[j1-i];
}
}
// q1 stores the values of u.T @ Q for each column
let mut q1 = vec![0.0;n];
for i1 in i..n {
for j1 in 0..n {
q1[j1] += u[i1-i]*q_lt[i1*n+j1];
}
}
// update the matrix Q
for i1 in i..n {
for j1 in 0..n {
q_lt[i1*n+j1] -= 2.0*u[i1-i]*q1[j1];
}
}
}
}
return (q_lt, r);
}
pub fn householder_reflection_qr(a:&[f64], n:usize, m:usize) -> (Vec<f64>, Vec<f64>, usize, usize, usize) {
// if n < m, transpose and solve qr
if n < m {
let aT = transpose(&a, n, m);
let (q, r) = householder_reflection_left_multiply(&aT, m, n);
return (transpose(&r, m, n), q);
}
else {
let (q, r) = householder_reflection_left_multiply(&a, n, m);
return (transpose(&q, n, n), r);
}
}
fn main() {
let n = 500;
let m = 500;
//generate matrix of nxm shape from normal distribution
let mut rng = thread_rng();
let normal:Normal<f64> = Normal::new(0.0, 1.0).ok().unwrap();
let mut a:Vec<f64> = vec![0.0;n*m];
for i in 0..n*m {
a[i] = normal.sample(&mut rng);
}
let (q, r) = householder_reflection_qr(&a, n, m);
}
In Givens Rotation, instead of targetting entire column at a time, individual cell values at (i, j) are targetted. For e.g. if we have to zero out A[i,j], then we can left multiply the 2xM matrix A[i-1:i+1, :] by a 2x2 matrix H, where:
H.A[i-1:i+1, :] only affects rows i-1 and i. Thus, we have to zero out the values in a column in bottom up fashion. The algorithm is as follows:
Let G = A - NxM matrix
Let H = I - NxN identity matrix
For each column j from 0 to M-1:
For each row i from N-1 to i+1:
1. Let r = sqrt(G[i-1,j]**2 + G[i,j]**2)
2. Let c = G[i-1,j]/r
3. Let s = -G[i,j]/r
4. Let Q = [[c, -s], [s, c]]
5. Update G[i-1:i+1, j:] = Q @ G[i-1:i+1, j:]
6. Update H[i-1:i+1, j:] = Q @ H[i-1:i+1, j:]
Let K = min(N, M)
return Q=H.T[:,:K], R=G[:K,:]
The time complexity of the above algorithm is O(NM²) which is similar to the Householder reflection method.
One advantage of Givens Rotation over Householder Reflection method is that in Givens Rotation, some cells can be zeroed out in parallel.
Since the Givens Rotation for (i, j) affects only rows i-1 and i, thus once we have zeroed out A[n-1,0] and A[n-2,0], then we can start to zero out columns 0 and 1 in parallel because column 0 will be working from row n-3 upwards and column 1 from row n-1 upwards.
Similar logic applies for all subsequent columns too.
In the above diagram, cells with same values can be zeroed in parallel. For e.g. cells (3,0), (5,1) and (7,2) all having seq. num 5 can be run in parallel.
Here is an implementation for Givens Rotation using Rust (without the parallel implementation):
fn givens_left_rotation(a:&[f64], _n:usize, m:usize, i:usize, j:usize) -> (f64, f64) {
let x = a[(i-1)*m+j];
let y = a[i*m+j];
let r = (x*x + y*y).sqrt();
return (x/r, -y/r);
}
fn givens_left_rotation_multiply(a:&mut [f64], _n:usize, m:usize, c:f64, s:f64, i:usize, _j:usize) {
for j1 in 0..m {
let p = a[(i-1)*m+j1];
let q = a[i*m+j1];
a[(i-1)*m+j1] = c*p - s*q;
a[i*m+j1] = s*p + c*q;
}
}
pub fn givens_rotation_qr(a:&[f64], n:usize, m:usize,) -> (Vec<f64>, Vec<f64>, usize, usize, usize) {
// if m > sqrt(n), transpose and solve qr
if n < m {
let mut r = transpose(&a, n, m);
let mut q = identity(m);
for j in 0..n {
for i in (j+1..m).rev() {
let b = givens_left_rotation(&r, m, n, i, j);
givens_left_rotation_multiply(&mut r, m, n, b.0, b.1, i, j);
givens_left_rotation_multiply(&mut q, m, m, b.0, b.1, i, j);
}
}
return (transpose(&r, m, n), q);
}
else {
let mut r = a.to_vec();
let mut q = identity(n);
for j in 0..m {
for i in (j+1..n).rev() {
let b = givens_left_rotation(&r, n, m, i, j);
givens_left_rotation_multiply(&mut r, n, m, b.0, b.1, i, j);
givens_left_rotation_multiply(&mut q, n, n, b.0, b.1, i, j);
}
}
return (transpose(&q, n, n), r);
}
}
The best part is that both Householder reflectors matrices and Givens Rotation matrices are orthogonal matrices.
Now, coming back to our SVD algorithm.
The first step of the algorithm is the ‘bidiagonalization’ of the input matrix A. The main diagonal and upper diagonal have non-zero values. The following matrix is a “upper-bidiagonal” matrix.
Bidiagonalization has quite a few advantages:
- It is “close” in shape to a diagonal matrix, which we ultimately want.
- The matrix is very sparse and we can efficiently perform different operations such as matrix multiplications quite cheaply.
- If we denote the bidiagonal matrix with B, then the matrix B^T.B has a tridiagonal form, which is again sparse. We would be interested in finding the eigenvalues and eigenvectors of B^T.B.
Bidiagonalization of a matrix can be done by using similar methods as in QR decomposition. Assuming that we choose the Householder reflection method, then first, we zero out all the elements below the main diagonal similar to the QR step seen earlier.
Then, we zero out all elements above the main diagonal by performing the Householder reflection on each row instead of each column. Note that it is not possible to zero out the diagonal above the main diagonal as then it will change the already zeroed out elements below the main diagonal.
Bidiagonalization using Rust:
fn householder_reflection_bidiagonalization(a:&[f64], n:usize, m:usize) -> (Vec<f64>, Vec<f64>, Vec<f64>) {
let mut q_lt = identity(n);
let mut q_rt = identity(m);
let mut r = a.to_vec();
for i in 0..min(n, m) {
let n1 = n-i;
let mut nm = 0.0;
let mut u = vec![0.0;n1];
for i1 in i..n {
nm += r[i1*m+i]*r[i1*m+i];
u[i1-i] = r[i1*m+i];
}
u[0] -= sgn(r[i*m+i])*nm.sqrt();
let z = (nm-r[i*m+i]*r[i*m+i]+u[0]*u[0]).sqrt();
if z > 0.0 {
for i1 in 0..n1 {
u[i1] = u[i1]/z;
}
let mut r1 = vec![0.0;m-i];
for i1 in i..n {
for j1 in i..m {
r1[j1-i] += u[i1-i]*r[i1*m+j1];
}
}
for i1 in i..n {
for j1 in i..m {
r[i1*m+j1] -= 2.0*u[i1-i]*r1[j1-i];
}
}
let mut q1 = vec![0.0;n];
for i1 in i..n {
for j1 in 0..n {
q1[j1] += u[i1-i]*q_lt[i1*n+j1];
}
}
for i1 in i..n {
for j1 in 0..n {
q_lt[i1*n+j1] -= 2.0*u[i1-i]*q1[j1];
}
}
}
if m-i-1 > 0 {
let n1 = m-i-1;
let mut nm = 0.0;
let mut u = vec![0.0;n1];
for j1 in i+1..m {
nm += r[i*m+j1]*r[i*m+j1];
u[j1-i-1] = r[i*m+j1];
}
u[0] -= sgn(r[i*m+i+1])*nm.sqrt();
let z = (nm-r[i*m+i+1]*r[i*m+i+1]+u[0]*u[0]).sqrt();
if z > 0.0 {
for i1 in 0..n1 {
u[i1] = u[i1]/z;
}
let mut r1 = vec![0.0;n-i];
for i1 in i..n {
for j1 in i+1..m {
r1[i1-i] += u[j1-i-1]*r[i1*m+j1];
}
}
for i1 in i..n {
for j1 in i+1..m {
r[i1*m+j1] -= 2.0*u[j1-i-1]*r1[i1-i];
}
}
let mut q1 = vec![0.0;m];
for i1 in 0..m {
for j1 in i+1..m {
q1[i1] += u[j1-i-1]*q_rt[i1*m+j1];
}
}
for i1 in 0..m {
for j1 in i+1..m {
q_rt[i1*m+j1] -= 2.0*u[j1-i-1]*q1[i1];
}
}
}
}
}
return (q_lt, r, q_rt);
}
Thus, after bidiagonalization step, we will obtain a bidiagonal matrix B:
But note that, we still have not diagonalized the matrix A and have only bidiagonalized it. The matrices Q_L and Q_R are all orthogonal and so are their products.
Thus, finding SVD of A is reduced to finding SVD of B. If the SVD of B is represented as follows:
Then, the SVD of A can be written as:
The next natural step, would be to find the SVD of B, which again we can do so by finding the eigenvalues and eigenvectors of B^T.B as seen above.
But note that we were motivated to use this method for SVD calculation in the first place, was because calculations of A^T.A or B^T.B can have very small values smaller than the machine precision, thus giving unstable results when the eigenvalues are small.
Thus, we should avoid going the route of calculating B^T.B. Then, what are the alternative ways to diagonalize the matrix B ? The idea is to use B^T.B without explicitly forming the matrix B^T.B.
Let T = B^T.B
If B is a bidiagonal matrix, then T is a tridiagonal symmetric matrix as follows:
Assuming there are Givens Rotation matrices G to be multiplied to the right of T and G^T to the left of T, that will eventually zero out the non-diagonal elements of T i.e.
Note that in QR decomposition once we zero out a cell (i, j) using Givens Left Rotation, we do not have to be bothered about the other columns that are affected in rows i-1 and i. But here, it is not so straightforward.
Remember that a Givens Left Rotation matrix for zeroing out (i, j) will affect only rows i-1 and i, whereas a Givens Right Rotation matrix will affect only columns j-1 and j.
Thus, if we zero out (0, 1) in T using G1 (Right Rotation), then it will affect only the columns 0 and 1 and (2,0) will be non-zero. Rest all zeros in columns 0 and 1 will be unaffected (if both columns have 0 in same row, both 0s are unaffected). Thus when we are zeroing out one cell, we are introducing a non-zero in another cell.
Similarly, if we apply G1^T to T.G1, then it is application of Givens Left Rotation and will affect rows 0 and 1. Thus, it will introduce a new non-zero values at (0,1) and (0,2). Columns where both rows 0 and 1 are 0 are unaffected.
Note, that application of G1^T to T.G1, will not zero out (1,0) because (1,0) value was changed after T.G1.
Repeat the process to zero out (0,2) and (2,0) by multiplying with G2 (right rotation) and G2^T (left rotation). This will introduce new non-zero values at (1,3) and (3,1).
Continue to repeat this process, until the new non-zero values are all chased out from T and we again obtain the tridiagonal form.
This entire process of “chasing out” a new non-zero element from T and restoring the tridiagonal form of T using Givens Left and Right rotation matrices is known as the Golub Kahan step.
In the below figure, the ‘x’ symbolizes a non-zero value. See how the 2 additional ‘x’’s are chased out to regain the original tridiagonal form.
After doing one round of Golub Kahan step as described above, the off-diagonal elements inches closer to 0 but may not yet be zero. After repeated application of the above steps, some of the off-diagonal elements starts to be closer to 0. We can assume that whenever the absolute value of an off-diagonal element is smaller than some threshold, consider it as 0.
To fasten the process of convergence to 0, Wilkinson Shift strategy is applied, which is basically using (T-uI) inplace of original T, where u is the eigenvalue of the bottom right 2x2 sub-matrix of T, that is closer to the last diagonal element.
Using (T-uI) inplace of T in the above steps will improve convergence rate for the off-diagonal elements going to 0.
But so far we have been doing left and right multiplications with Givens matrices on T but remember that we do not want to calculate T as because this will lead to error in calculations when the eigenvalues of T are smaller than the machine precision.
In the equation:
If we replace T with B^T.B, we get:
Thus, instead of applying G1, G2, … Gk to T, we can apply them directly to B. But note that applying only the right rotation matrices G1, G2, … to B will introduce new non-zero elements in B in the lower diagonal.
These new non-zero elements introduced, can be chased out by Givens Left Rotation matrices.
For e.g. let there be Givens Left rotation matrices Q1, Q2… s.t.
Thus, left multiplying by matrices Q to keep the bidiagonal matrix B in the bidiagonal form does not alter the product as Q are orthogonal matrices.
Similar to the tridiagonal matrix, where multiplying left and right by Givens matrices chased away the new non-zero elements, we can see the similar thing happening with B also:
Note that, the above steps of multiplying B on the left and right by Givens rotation matrices to gradually remove the off-diagonal elements as well as chase away unwanted non-zero elements introduced during the process is applicable to only a bidiagonal matrix form.
Once some off-diagonal elements or some diagonal elements become 0, we have to divide the current matrix s.t. each part is bidiagonal and separately process each bidiagonal matrix.
For e.g. if a non-diagonal element say B[i,i+1] becomes 0 after sometime, then we separately process B[:i+1,:i+1] and B[i+1:,i+1:].
In the below example i=2.
In the event that a diagonal element becomes 0 (implying at-least one eigenvalue of T is 0), then we can handle this situation as below:
For e.g. if B[i,i] = 0 and B[i,i+1] != 0, then applying Givens left rotation on rows i and i+1 to B to zero out B[i,i+1], will introduce a new non-zero at B[i,i+2]. Again applying Givens left rotation on rows i and i+1 to zero out B[i,i+2] will introduce new non-zero at B[i,i+3] and so on till B[i,M-1].
After that, the final Givens Rotation matrix will chase out the unwanted non-zero value from the matrix.
In the below example B[2,2] = 0
Once we obtain multiple bidiagonal sub-matrices, we can process the Golub Kahan steps in parallel for each. The above Golub Kahan steps works fine even if the last diagonal of a bidiagonal matrix is 0.
Since we are multiplying the bidiagonal matrix left and right with Givens matrices after doing shift by the eigenvalue of the bottom right 2x2 sub-matrix closer to the last diagonal, the last super diagonal i.e. B[N-2,N-1] will be the first to go to zero. Then, after we process the remaining (N-2)x(N-2) matrix, the next off diagonal to go to zero would be B[N-3,N-2] and so on.
Thus, the off diagonal elements in B starts to go to zero starting from bottom right. Starting from i=N-2, for each i, we can check if B[i,i+1]= 0, if it is 0 then continue going to the top by decreasing i else stop and process the sub-matrix B[:i+1,:i+1].
Return after the matrix B have become a diagonal matrix.
The algorithm is as follows :
Instead of processing multiple bidiagonal matrices after splits in parallel, here we are processing them in bottom up fashion. We can use multithreading to process them in parallel but would not expect lot of performance gains as these are CPU bound calculations.
Let A - NxM matrix
1. B = P @ A @ Q where B is bidiagonal matrix (Householder Reflection)
2. Loop 3-8
3. Set B[i,i+1] = 0 if abs(B[i,i+1]) < eps*(abs(B[i,i) + abs(B[i+1,i+1))
4. Find q s.t. B[i,i+1]=0 for all i in (q+1, r-1), if not found, set q = 0
5. Break if q = 0
6. Find p s.t. B[i,i+1]!=0 for all i in (p, q), if not found, set p = 0
7. If for any i in (p, q), B[i,i] = 0, remove the off-diagonal at B[i,i+1],
using Givens Left Rotation as described above and update the matrices
P and Q in step 1.
8. If no i exists in (p, q), s.t. B[i,i] = 0, then run Golub Kahan step
for the sub-matrix (p,p) to (q,q) and update the matrices
P and Q in step 1.
return (P, B, Q)
Let U^T be the product of all orthogonal matrices applied so far to the left of A including the Givens left rotation matrices to B and the householder reflection matrices for bidiagonalization from A to B and V product of all orthogonal matrices applied so far to the right of A. Then,
Thus, we have found the SVD for A. S is a diagonal matrix.
Here is the Rust implementation for the above algorithm:
use std::simd::prelude::*;
use rand_distr::{Distribution, Normal};
use rand::thread_rng;
use std::cmp::min;
use std::time::SystemTime;
// Get sub-matrix (r_start, c_start) and (r_end, c_end)
fn sub_mat(a:&[f64], _n:usize, m:usize, r_start:usize, r_end:usize, c_start:usize, c_end:usize) -> Vec<f64> {
let mut b:Vec<f64> = vec![0.0;(r_end-r_start+1)*(c_end-c_start+1)];
let u = c_end-c_start+1;
let mut k = 0;
for i in r_start..(r_end+1) {
b[k*u..(k+1)*u].copy_from_slice(&a[i*m+c_start..i*m+c_end+1]);
k += 1;
}
return b;
}
// Bidiagonalization using Householder transformation
fn householder_reflection_bidiagonalization(a:&[f64], n:usize, m:usize) -> (Vec<f64>, Vec<f64>, Vec<f64>) {
let mut q_lt = identity(n);
let mut q_rt = identity(m);
let mut r = a.to_vec();
for i in 0..min(n, m) {
let n1 = n-i;
let mut nm = 0.0;
let mut u = vec![0.0;n1];
for i1 in i..n {
nm += r[i1*m+i]*r[i1*m+i];
u[i1-i] = r[i1*m+i];
}
u[0] -= sgn(r[i*m+i])*nm.sqrt();
let z = (nm-r[i*m+i]*r[i*m+i]+u[0]*u[0]).sqrt();
if z > 0.0 {
for i1 in 0..n1 {
u[i1] = u[i1]/z;
}
let mut r1 = vec![0.0;m-i];
for i1 in i..n {
for j1 in i..m {
r1[j1-i] += u[i1-i]*r[i1*m+j1];
}
}
for i1 in i..n {
for j1 in i..m {
r[i1*m+j1] -= 2.0*u[i1-i]*r1[j1-i];
}
}
let mut q1 = vec![0.0;n];
for i1 in i..n {
for j1 in 0..n {
q1[j1] += u[i1-i]*q_lt[i1*n+j1];
}
}
for i1 in i..n {
for j1 in 0..n {
q_lt[i1*n+j1] -= 2.0*u[i1-i]*q1[j1];
}
}
}
if m-i-1 > 0 {
let n1 = m-i-1;
let mut nm = 0.0;
let mut u = vec![0.0;n1];
for j1 in i+1..m {
nm += r[i*m+j1]*r[i*m+j1];
u[j1-i-1] = r[i*m+j1];
}
u[0] -= sgn(r[i*m+i+1])*nm.sqrt();
let z = (nm-r[i*m+i+1]*r[i*m+i+1]+u[0]*u[0]).sqrt();
if z > 0.0 {
for i1 in 0..n1 {
u[i1] = u[i1]/z;
}
let mut r1 = vec![0.0;n-i];
for i1 in i..n {
for j1 in i+1..m {
r1[i1-i] += u[j1-i-1]*r[i1*m+j1];
}
}
for i1 in i..n {
for j1 in i+1..m {
r[i1*m+j1] -= 2.0*u[j1-i-1]*r1[i1-i];
}
}
let mut q1 = vec![0.0;m];
for i1 in 0..m {
for j1 in i+1..m {
q1[i1] += u[j1-i-1]*q_rt[i1*m+j1];
}
}
for i1 in 0..m {
for j1 in i+1..m {
q_rt[i1*m+j1] -= 2.0*u[j1-i-1]*q1[i1];
}
}
}
}
}
return (q_lt, r, q_rt);
}
// Get the eigenvalue of the bottom right 2x2 sub-matrix which is closer
// to the last diagonal element in B
fn eigenvalue_bidiagonal(a:&[f64], n:usize, m:usize, i1:usize, i2:usize, j1:usize, j2:usize) -> f64{
let h = min(i2, j2)+1;
let mut d1 = 0.0;
let mut d2 = 0.0;
let mut d3 = 0.0;
let mut d4 = 0.0;
if h >= 3 {
d1 = a[(h-3)*m+h-2];
}
if h >= 2 {
d2 = a[(h-2)*m+h-2];
d3 = a[(h-2)*m+h-1];
}
if h >= 1 {
d4 = a[(h-1)*m+h-1];
}
let a1 = d2*d2 + d1*d1;
let a2 = d2*d3;
let a3 = d2*d3;
let a4 = d4*d4 + d3*d3;
let u = 1.0;
let b = -(a1+a4);
let c = a1*a4-a2*a3;
let v1 = (-b + (b*b-4.0*u*c).sqrt())/(2.0*u);
let v2 = (-b - (b*b-4.0*u*c).sqrt())/(2.0*u);
if (v1-a4).abs() < (v2-a4).abs() {
return v1;
}
return v2;
}
// Get (c, s) value for Givens right rotation matrix
fn givens_right_rotation(a:&[f64], _n:usize, m:usize, i:usize, j:usize, flip:bool) -> (f64, f64) {
let x = a[i*m+j-1];
let y = a[i*m+j];
let r = (x*x + y*y).sqrt();
if flip {
return (y/r, -x/r);
}
return (x/r, -y/r);
}
// Multiple a by Givens right rotation matrix
fn givens_right_rotation_multiply(a:&mut [f64], n:usize, m:usize, c:f64, s:f64, _i:usize, j:usize, r1:usize, r2:usize, c1:usize, c2:usize) {
for i1 in r1..r2+1 {
let p = a[i1*m+j-1];
let q = a[i1*m+j];
a[i1*m+j-1] = c*p - s*q;
a[i1*m+j] = s*p + c*q;
}
}
// Get (c, s) value for Givens left rotation matrix
fn givens_left_rotation(a:&[f64], _n:usize, m:usize, i:usize, j:usize, flip:bool) -> (f64, f64) {
let x = a[(i-1)*m+j];
let y = a[i*m+j];
let r = (x*x + y*y).sqrt();
if flip {
return (y/r, -x/r);
}
return (x/r, -y/r);
}
// Multiple a by Givens left rotation matrix
fn givens_left_rotation_multiply(a:&mut [f64], _n:usize, m:usize, c:f64, s:f64, i:usize, _j:usize, r1:usize, r2:usize, c1:usize, c2:usize) {
for j1 in c1..c2+1 {
let p = a[(i-1)*m+j1];
let q = a[i*m+j1];
a[(i-1)*m+j1] = c*p - s*q;
a[i*m+j1] = s*p + c*q;
}
}
// Golub Kahan step on biadiagonal matrix
fn golub_kahan(a:&mut [f64], l:&mut [f64], r:&mut [f64], n:usize, m:usize, z:usize, i:usize, j:usize) {
let mu = eigenvalue_bidiagonal(&a, z, z, i, j, i, j);
let u = a[i*z+i];
let v = a[i*z+i+1];
a[i*z+i] = u*u-mu;
a[i*z+i+1] = u*v;
for k in i..j {
let mut x;
let mut y;
if k > i {
x = k-1;
y = k+1;
}
else {
x = i;
y = i+1;
}
let b = givens_right_rotation(&a, z, z, x, y, false);
if k == i {
a[i*z+i] = u;
a[i*z+i+1] = v;
}
givens_right_rotation_multiply(a, z, z, b.0, b.1, x, y, i, j, i, j);
givens_right_rotation_multiply(r, m, z, b.0, b.1, x, y, 0, m-1, 0, z-1);
if k > i {
x = k+1;
y = k;
}
else {
x = i+1;
y = i;
}
let b = givens_left_rotation(&a, z, z, x, y, false);
givens_left_rotation_multiply(a, z, z, b.0, b.1, x, y, i, j, i, j);
givens_left_rotation_multiply(l, z, n, b.0, b.1, x, y, 0, z-1, 0, n-1);
}
}
// Golub Reisch SVD algorithm
pub fn golub_reisch_svd(a:&[f64], mut n:usize, mut m:usize) -> (Vec<f64>, Vec<f64>, Vec<f64>) {
let mut a1 = a.to_vec();
let mut do_transpose = false;
if n < m {
do_transpose = true;
a1 = transpose(&a1, n, m);
let g = n;
n = m;
m = g;
}
let hr = householder_reflection_bidiagonalization(&a1, n, m);
let r = min(n, m);
let mut u = sub_mat(&hr.0, n, n, 0, r-1, 0, n-1);
a1 = sub_mat(&hr.1, n, m, 0, r-1, 0, r-1);
let mut v = sub_mat(&hr.2, m, m, 0, m-1, 0, r-1);
let eps = 1.0e-7;
loop {
for i in 0..r-1 {
if a1[i*r+i+1].abs() < eps*(a1[i*r+i].abs() + a1[(i+1)*r+i+1].abs()) {
a1[i*r+i+1] = 0.0;
}
}
let mut q = 0;
for i in (0..r-1).rev() {
if a1[i*r+i+1].abs() > 0.0 {
q = i+1;
break;
}
}
if q == 0 {
break;
}
let mut p = 0;
for i in (0..q).rev() {
if a1[i*r+i+1].abs() == 0.0 {
p = i+1;
break;
}
}
let mut flag: bool = false;
for i in p..q {
if a1[i*r+i].abs() == 0.0 {
flag = true;
for j in i+1..r {
let b = givens_left_rotation(&a1, r, r, i+1, j, true);
givens_left_rotation_multiply(&mut a1, r, r, b.0, b.1, i+1, j, 0, r-1, 0, r-1);
givens_left_rotation_multiply(&mut u, r, n, b.0, b.1, i+1, j, 0, r-1, 0, n-1);
}
}
}
if !flag && p < q {
golub_kahan(&mut a1, &mut u, &mut v, n, m, r, p, q);
}
}
if do_transpose {
return (v, transpose(&a1, r, r), u);
}
return (transpose(&u, r, n), a1, transpose(&v, m, r));
}
// Multiply matrices inp1 and inp2 using SIMD instructions
pub fn matrix_multiply_simd(inp1:&[f64], inp2:&[f64], n:usize, m:usize, p:usize) -> Vec<f64> {
const LANES:usize = 64;
let mut out:Vec<f64> = vec![0.0;n*p];
for i in 0..n {
for k in 0..m {
let a:Simd<f64, LANES> = Simd::splat(inp1[i*m+k]);
for j in (0..p).step_by(LANES) {
if k*p+j+LANES > (k+1)*p {
let mut r:usize = i*p+j;
for h in k*p+j..(k+1)*p {
out[r] += inp1[i*m+k]*inp2[h];
r += 1;
}
}
else {
let x:Simd<f64, LANES> = Simd::from_slice(&inp2[k*p+j..k*p+j+LANES]);
let z:Simd<f64, LANES> = a*x;
let mut c:Simd<f64, LANES> = Simd::from_slice(&out[i*p+j..i*p+j+LANES]);
c += z;
Simd::copy_to_slice(c, &mut out[i*p+j..i*p+j+LANES]);
}
}
}
}
return out;
}
fn main() {
let n = 500;
let m = 500;
let mut rng = thread_rng();
let normal:Normal<f64> = Normal::new(0.0, 1.0).ok().unwrap();
let mut a:Vec<f64> = vec![0.0;n*m];
for i in 0..n*m {
a[i] = normal.sample(&mut rng);
}
let start_time = SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap().as_millis();
let b = golub_reisch_svd(&a, n, m);
let end_time = SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap().as_millis();
println!("{:?}", end_time-start_time);
let r = min(n, m);
let mut c = matrix_multiply_simd(&b.0, &b.1, n, r, r);
c = matrix_multiply_simd(&c, &b.2, n, r, m);
for i in 0..n*m {
assert!((c[i]-a[i]).abs() < 1e-5, "Some issue in SVD !!!");
}
}
The Rust codes are not highly optimized although after build and run with — release flags and compilation with C++ “target-cpu=native”, we saw some pretty good performance for relatively smaller matrices of the order of N=500 and M=500, which are comparable to the “np.linalg.svd” algorithm, which is implemented using LAPACK’s highly optimized FORTRAN routines.
RUSTFLAGS="-C target-cpu=native" cargo run --release
Possible optimizations we could do (which may or may not improve performance are):
- Using multithreading for parallel operations.
- Using SIMD instructions for block operations.
But the above code is not yet ready to handle very small values of the order of 1e-100. One of the reasons being that if a value x ~ O(1e-100), then x² ~ O(1e-200). With such small values, sqrt() method in Rust is going to 0 and thus causing errors in calculations.
One way is to handle this is to use Taylor Series approximation of sqrt() method when the value is smaller than some threshold.
For e.g. in the Givens Rotation matrixes, calculating the values of c and s could be done as follows (assuming x > y) :
We can terminate the series at the point where the absolute value of the i-th term becomes less than say 1% of the total sum.
The following Rust code can be used to calculate the value of r in the above equations.
fn hypot(x:f64, y:f64) -> f64 {
let mut i = 1.0;
let mut s = 1.0;
let mut r = 0.5;
let w;
let u;
if x.abs() > y.abs() {
w = (y/x).abs();
u = x.abs();
}
else {
w = (x/y).abs();
u = y.abs();
}
let mut z = w*w;
loop {
let p = (r/i)*z;
if p.abs()/s < 0.01 {
break;
}
s += p;
r *= 0.5-i;
i *= i+1.0;
z *= w*w;
}
return s.abs()*u;
}
Similarly, we should take care of all possible operations wherever we are computing x² terms and probably avoid them.
Given that the time complexity of the above algorithm is O(NM²), we can try to optimize the run-time performance of the SVD algorithm by using a lower value of M (i.e. number of columns)
But, how do we choose a lower M ? By projecting the full matrix of M columns into a lower dimension.
Let P be a matrix of dimensions MxL where L < M. If we multiply A.P, then we will get a new matrix A’ of dimensions NxL. Now if we do SVD on A, the time complexity of the algorithm would be O(NL²). If L << M, it will improve the run time performance a lot.
A’ = A.P
If we do SVD of A, then we will have:
To get the SVD of A, we require P to be an orthogonal matrix. How do we choose an orthogonal matrix, is the question here ?
Instead of choosing an orthogonal matrix, which is a non-trivial operation, we can first find the QR decomposition of A’=A.P i.e.
So, Q is an orthogonal matrix. Instead of multiplying A.P we do Q^T.A i.e.
Now since both Q and U’ are orthogonal matrices, their product is also an orthogonal matrix, thus the matrix U corresponding to A is Q.U’ where U’ is the matrix corresponding to SVD of A’.
Since A.P is of dimensions NxL, where N > L, Q is also an orthogonal matrix of dimensions NxL (see above) and R is an upper diagonal matrix of dimensions LxL.
A’=Q^T.A is a matrix of dimensions LxM. We can perform SVD on A’ by first transposing A’ to a MxL matrix where L << M, and then again transposing the final matrices U, S and V as follows:
The initial matrix P of shape MxL can be chosen from a random normal distribution. This process is popularly known as the Randomzied SVD or Truncated SVD algorithm and is used widely with low rank matrices such as in text classification problems or recommendation problems etc.
The results obtained after Truncated SVD may or may not be close to the true values of the actual SVD because this is a probabilistic algorithm. On increasing the value of L, the results would be closer to the true SVD of A.
The task is to find a good balance between speed and accuracy by choosing an optimal value of L.
pub fn randomized_svd(a:&[f64], n:usize, m:usize, l:usize) -> (Vec<f64>, Vec<f64>, Vec<f64>){
let mut rng = thread_rng();
let normal:Normal<f64> = Normal::new(0.0, 1.0).ok().unwrap();
let mut p:Vec<f64> = vec![0.0;m*l];
for i in 0..m*l {
p[i] = normal.sample(&mut rng);
}
let b = matrix_multiply_simd(&a, &p, n, m, l);
let (mut q, _r, n1, m1, _p1) = givens_rotation_qr(&b, n, l);
let w= min(m1, l);
q = sub_mat(&q, n1, m1, 0, n1-1, 0, w-1);
let c = &matrix_multiply_simd(&transpose(&q, n1, w), &a, w, n1, m);
let (mut u, s, v) = golub_reisch_svd(&c, w, m);
u = matrix_multiply_simd(&q, &u, n1, w, w);
return (u, s, v);
}
Link to the entire code in Github: rust_machine_learning/src/svd.rs at main · funktor/rust_machine_learning (github.com)
References
- Class_05_dimensionality_reduction.pdf (edoliberty.github.io)
- Using Singular Value Decomposition to Build a Recommender System — MachineLearningMastery.com
- SVD Image Compression, Explained | dmicz devblog
- pseudo.pdf (caltech.edu)
- ln15.pdf (utep.edu)
- C5106_C045.tex (utexas.edu)
- Applied Numerical Linear Algebra. Lecture 9 (chalmers.se)
- arXiv:0909.4061v2 [math.NA] 14 Dec 2010
- people.duke.edu/~hpgavin/SystemID/References/Golub+Reinsch-NM-1970.pdf
- Vortrag2022.pdf (ethz.ch)
- chan.pdf (stanford.edu)
- Parallel Numerical Algorithms — Chapter 11 — QR Factorization (illinois.edu)