The SVD “transpose trick” — save time and make heavy matrix calculations possible

Benchmark results and reproducible code for an SVD short-cut in Python easily translatable into other languages, too. Fair hint: the method helps only if the number of data examples (e.g. images) is significantly smaller than the number of features (e.g. pixels)

Noname Noname
6 min readMay 17, 2020
Boost the snail in your matrix calculations (photo by Peggychoucair)

Singular value decomposition (known as SVD) is a common tool from linear algebra used in data science, machine learning and other areas of applied mathematics. SVD finds the eigenvectors and -values to a given matrix and is used a.o. for principal component analysis (PCA) and zero-phase component analysis (ZCA).

Unfortunately, SVD calculations are veeery slow for large matrices, that are frequently used for large data in machine learning or data science. Examples of SVD succumbing to the large size of matrices can be found in this and this issues of keras (a deep learning library written in Python) regarding the use of the ZCA option. For example, on my laptop’s CPU (Intel 7th Gen Core i7-Y) if I would like to calculate the SVD of a data matrix with 19200 features (80 pixels * 80 pixels * 3 colour channels = 19200, actually quite small for images) it would take hours, if not days, or it would crash my python kernel.

So, is there a way to make SVD quicker?: Yes. For one, if you are using Python note that for SVD the package Jax is faster than scipy is faster than numpy. This advice can help you gain speed up to a factor of 10 on small images (ca. 50 pixels * 50 pixels), but if you need to go beyond, you should use some of the standard approximations of SVD with a trade-off between accuracy and speed — randomized svd by scikitlearn, truncated svd by irlb, Lanczos method svd a.o. The downside of these approximations is that they require user-specified parameters or the approximation sometimes converges and sometimes not even for the very same matrix

A simpler way: It turns out there is another SVD calculation method known widely as “transpose trick”, involving no additional parameters, that works by just transposing your initial data matrix X with dimensions m=number of examples, n=number of features. It has to be stated the trick is faster than regular SVD only if m<n!

Let’s see an example of SVD being applied in machine learning — the ZCA calculation in keras. What I show below holds whenever you apply SVD, regardless if for PCA, ZCA or something else, I just decided to use ZCA as my example. A minimal version of the code, taken from here, looks like this:

from keras.datasets import cifar10
import numpy as np
from scipy import linalg
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X = X_train[:1000]
flat_x = np.reshape(X, (X.shape[0], X.shape[1] * X.shape[2] * X.shape[3]))
# normalize x:
flat_x = flat_x / 255.
flat_x = flat_x - flat_x.mean(axis=0)
#calculating the covariance matrix sigma
sigma = np.dot(flat_x.T, flat_x) / flat_x.shape[0]
u, s, _ = linalg.svd(sigma)
s_inv = 1. / np.sqrt(s[np.newaxis] + 0.1) # the 0.1 is the epsilon value for zca
principal_components = (u * s_inv).dot(u.T)
whitex = np.dot(flat_x, principal_components)

The bulk of the calculation time gets spent in the line

u, s, _ = linalg.svd(sigma)

with sigma having the shape (n,n), n=number of features, in this particular case of cifar10 n=32*32*3=3072 pixel. On my computer executing the full zca code above takes ca. 28sec. The “transpose trick” suggests that if the number m of examples (images) is smaller than the number n of features (pixel) you could save calculation time by transposing X and calculating linalg.svd(sigma) on sigma with shape (m,m) with m=1000 in our case. You can implement this change with just 2 corrections in the above code:

from keras.datasets import cifar10
import numpy as np
from scipy import linalg
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X = X_train[:1000]
flat_x = np.reshape(X, (X.shape[0], X.shape[1] * X.shape[2] * x.shape[3]))
# normalize x:
flat_x = flat_x / 255.
flat_x = flat_x - flat_x.mean(axis=0)
# CHANGES HAPPEN BELOW.
# if m>n execute the svd as usual
if flat_x.shape[0] >= flat_x.shape[1]:
sigma = np.dot(flat_x.T, flat_x) / flat_x.shape[0]
u, s, _ = linalg.svd(sigma)
# and if m<n do the trnaspose trick
if flat_x.shape[0] < flat_x.shape[1]:
sigma2 = np.dot(flat_x, flat_x.T) / flat_x.shape[0]
u2, s, _ = linalg.svd(sigma2)
u = np.dot(flat_x.T, u2) / np.sqrt(s*flat_x.shape[0])
s_inv = 1. / np.sqrt(s[np.newaxis] + 0.1) # the 0.1 is the epsilon value for zca
principal_components = (u * s_inv).dot(u.T)
whitex = np.dot(flat_x, principal_components)

Executing this code with the “transpose trick” (if flat_x.shape[0] < flat_x.shape[1]) reduces the calculation time from 28sec to only 2sec! This is quite an achievement, but how about the “trick” results, how close are they to the standard SVD solution? They are exactly the same as in the standard approach. For the sake of visuality I show what the calculation results look like on one of the cifar10 images from above:

The improvement in calculation time gets crasser for larger images/ greater number of features n. When I try the standard SVD on a sigma with sigma.shape=(19200,19200) , where n=19200=80pixel*80pixel*3channels, this crashes my python kernel, my computer just can’t handle it. On the other hand, if I use only m=500 images and use the transpose trick I calculate the results in <1sec.

For the curious ones — here is the mathematical justification

It’s friendly and goes like this: We want to find the eigenvectors of the matrix sigma = np.dot(flat_x.T, flat_x) — actually we want to find their SVD decomposition, but SVD is a generalization of eigendecomposition for non-quadratic matrices, so we can treat them as the same in this explanation — and this means that we want to find u so that it holds

We would like to do this with line u, s, _ = linalg.svd(sigma) above, but for m<n this turns out to be very time consuming. So we do the transpose trick and we find instead the eigevectors u2 of np.dot(flat_x, flatx.T), means they fulfill the equation

We do this in the line u2,s2,_=linalg.svd(sigma2) above. Alright, but how does u2 help us to find u ? This is answered by the mathematical expressions here (slide 30) and here, that look for our case like this:

Line 1. shows what we actually want to achieve (get u ). In line 2. we just multiply both sides of the equation with X. In line 3. we pull λ to the front on the right side of the equation, and on the left side the brackets are reordered to display the transpose trick — calculating u2, the eigenvector of np.dot(flat_x, flat_x.T). Once we have u2 equ. 1. from above shows us that by multiplying it on X^T we will get our long-desired u:

And this is the mathematical justification why calculating the eigenvectors u2 of np.dot(flat_x, flat_x.T) and then multiplying u2 with flat_x.T will give us u . Equ. 3. from above also shows that u and u_2 have the same eigenvalues λ.

The part1/np.sqrt(s*flat_x.shape[0]) is just a factor for stretching the eigenvectors and it is contained in the eigenvalues lambda. You can find versions of the factor without flat_x.shape[0] if sigma is calculated before as sigma=np.dot(flat_x,flat_x.T)

--

--

Noname Noname
0 Followers

Astrophysicist by education, machine learning enthusiast