Why do we even care about tensor decomposition ?
Even though it’s an active area of research and there are many applications of tensor decomposition, such as signal processing, numerical algebra, computer vision, numerical analysis, machine learning, graph analysis and many more. We will mainly learn about a technique that will help us to train our image classifiers faster and which will compress our models so that they can fit devices in our pockets and run efficiently in them.
Speeding up the training and reducing the memory footprint is not only an interest for big tech companies who want to run AI as mobile services but also useful for practitioners with low resources. DAWNBench is a great example of what you can achieve with limited resources as long as you know the right techniques.
Back to Basics
Before getting started, this post will assume that the reader is already familiar with basic deep learning concepts, convolutional neural networks and basic linear algebra.
If you are already familiar with matrix decomposition then you can think tensor decomposition as a higher order matrix decomposition. If you are not familiar with it or need a refresher here we will briefly go over some of the basic ideas. In general we decompose matrices to solve systems of linear equations, to reduce the data dimension, to fill missing values, to find approximations for eigenvalues and for many other cool applications.
Some of the popular matrix decomposition techniques are:
Singular Value Decomposition (SVD) is a very fundamental algorithm to decompose any given matrix into UΣV*. Here U and V* are orthogonal matrices.
It is the generalization of the eigendecomposition of a positive semidefinite normal matrix. We can use QR shift factorization to find singular values and solve for SVD faster.
Principal component analysis is a very popular practical method used in statistics and data science and it can be thought as a by product of SVD. In SVD we decompose our matrix into orthogonal and diagonal components and PCA tries to identify orthogonal vectors which maximizes the variance explained in data.
Let X = UΣV* then principal components after ordering singular values can be written as:
X1 = u1σ1v1*
X2 = u2σ2v2*
Non-negative matrix factorization (NMF) is widely used in recommender systems and topic modelling. NMF will decompose a non-negative data matrix into two components containing non-negative values by minimizing the reconstruction error, e.g. Frobenius norm. In terms of optimization, constraining ourselves into non-negativity is not fundamental but it increases the interpretability of the results obtained by the factorization.
LU decomposition uses Gaussian elimination to decompose a matrix into following two matrices: L for lower triangular and U for upper triangular.
QR is a very fundamental algorithm in linear algebra it is used to answer 3 out of 4 main problems that linear algebra tries to solve. It is used in Least Squares Problem, Eigenvalue Problem and Finding SVD Problem. For all these problems QR factorization is the optimal algorithm.
There are mainly three ways of computing QR decomposition, Gram-Schmidt process, householder reflections and Givens rotations. QR decomposes a matrix into orthogonal and upper triangular matrices.
We can use cholesky decomposition to solve for Ax = b, Least Squares Problem though still QR is more optimal compared to Cholesky. Cholesky factorization is a special case of LU where A is positive definite and algorithm is stable. It will decompose into a lower triangular matrix and conjugate transpose of that matrix.
Both CP (CANDECOMP/PARAFAC) and Tucker decompositions can be considered to be higher order generalizations of the matrix singular value decomposition (SVD) and principal component analysis (PCA).
Canonical Rank: Minimal possible R in CP decomposition.
In 2d low-rank approximation can be computed in a stable way by using SVD or if the matrix is large by rank revealing algorithms. This is not the case for the CP decomposition when d > 2, as there is no finite algorithm for determining canonical rank of a tensor. Therefore, most algorithms approximate a tensor with different values of R until the approximation error becomes small enough.
Here is some terminology about tensors.
NLS: Non-linear least square which minimizes L2-norm of the approximation residual for a user defined fixed R.
fiber: Fixing every index but one in a tensor.
row: mode-2 fiber
column: mode-1 fiber
slice: 2 dimensional sections of a tensor
Third order tensors have column, row, and tube fibers denoted as x:jk, xi:k, xij:
horizontal view: Xi::
lateral view: X:j:
frontal view: X::k
||X|| = norm of tensor is square root of sum of square of all of it’s elements
Inner Product of two same sized tensors is the sum of the product of their entries:
import tensorly as tl # using pytorch >= 0.4 backendA = torch.randint(0, 100, (3,3,3))
B = torch.randint(0, 100, (3,3,3))
tl.tenalg.inner(A, B) == torch.sum((A*B))
<X, X> = X²
An N way tensor or array is rank one if it can be written as the outer product of N vectors:
X = a o b o c
o: denotes outer product
If a tensor has same size along every dimension (every mode is same size) then it’s called cubical.
Matricization: Unfolding in a given dimension(s) to make a tensor a matrix.
Important Matrix Operations
Kronecker Product: Generalization of outer product from vectors to matrices.
Khatri-Rao: Kronecker product using columns of each matrix.
Now, we will look into a particular tensor decomposition algorithm called CP Decomposition.
The basic idea behind CP decomposition is that you define a rank R, then sum of outer product of rank 1 tensors is your approximation. Still, approximating rank of a tensor is a hard problem:
tensorly is a great library when working with tensors and it provides off the shelf functions for tensor decomposition. In this library CP decomposition optimizes Frobenius norm of difference of constructed tensor and the original tensor using ALS (Alternating Least Squares).
Here is our objective function:
But we can also use PyTorch and Adam optimizer or any other optimizer to implement CP decomposition ourselves. Here is a code snippet for decomposing a rank-3 tensor using CP decomposition via Adam optimizer, learning rate is adjusted by trial and error so it may change for your own case. I’ve also compared construction errors for different ranks between tensorly and my implementation, difference was negligible. So, we can say that below is a PyTorch version of the same optimization problem.
Here is a plot of approximation errors for different rank decomposition. As you will notice as we increase the rank reconstruction error decreases, which is the expected behavior.
In part 2 of this blog post, I will explain how we can use CP decomposition to decompose convolutional layers of a CNN and reduce the memory footprint of a image classifier like VGG, so stay tuned !
Thanks for reading the blog, hope you enjoyed it !
- tensorly library: talk bit about tensorly too