# Singular Value Decomposition (SVD) Tutorial: Applications, Examples, Exercises

## A complete tutorial on the singular value decomposition method

Every so often, maybe once or twice a decade, a new mathematical technique or algorithm comes along that changes the way we do things. Maybe the method starts out in a small niche or field but eventually expands to many other, completely unrelated disciplines and you cannot stop thinking of new uses for it. We’re talking about techniques like fast Fourier decomposition, Monte Carlo integration, simulated annealing, Runge Kutta integration, and pseudo-random number generation.

A friend of the Statsbot team, Peter Mills, calls such methods “power tools.” We asked him to tell us about one of these tools — singular value decomposition, or SVD, with examples and applications.

This singular value decomposition tutorial assumes you have a good working knowledge of both matrix algebra and vector calculus. We start with a short history of the method, then move on to the basic definition, including a brief outline of numerical procedures. This is then followed by a more intuitive derivation meant to demonstrate the meaning of singular value decomposition and then to a pair of common applications.

The final section works out a complete program that uses SVD in a machine-learning context. To help you become more familiar with the material, exercises are provided throughout.

# History

The technique of singular value decomposition, or SVD for short, has a long and somewhat surprising history. It started out in the social sciences with intelligence testing. Early intelligence researchers noted that tests given to measure different aspects of intelligence, such as verbal and spatial, were often closely correlated.

Because of this, they hypothesized that there was a general measure of intelligence in common, which they called “g,” for “general intelligence,” now more commonly known as “I.Q.” So they set about teasing out the different factors that made up intelligence so as to pull out the most important one.

SVD is known under many different names. In the early days, as the above passage implies, it was called, “factor analysis.” Other terms include principal component (PC) decomposition and empirical orthogonal function (EOF) analysis. All these are mathematically equivalent, although the way they are treated in the literature is often quite different.

Today, singular value decomposition has spread through many branches of science, in particular psychology and sociology, climate and atmospheric science, and astronomy. It is also extremely useful in machine learning and in both descriptive and predictive statistics.

# Technical introduction

Singular value decomposition is a method of decomposing a matrix into three other matrices:

Where:

• A is an m × n matrix
• U is an m × n orthogonal matrix
• S is an n × n diagonal matrix
• V is an n × n orthogonal matrix

The reason why the last matrix is transposed will become clear later on in the exposition. Also, the term, “orthogonal,” will be defined (in case your algebra has become a little rusty) and the reason why the two outside matrices have this property made clear.

For the moment, we will assume that m ≥ n. What happens when this isn’t true is quite interesting and is one of the keys, in my opinion, to understanding singular value decomposition.

This is already becoming quite complicated so I will rewrite Equation (1) using summation notation. This is my go-to method of proceeding whenever I am having trouble with a matrix equation. In this case, while it doesn’t make anything simpler, it does make everything absolutely explicit:

Note how we’ve collapsed the diagonal matrix, S, into a vector, thus simplifying the expression into a single summation. The variables, {sᵢ}, are called singular values and are normally arranged from largest to smallest:

The columns of U are called left singular vectors, while those of V are called right singular vectors.

We know that U and V are orthogonal, that is:

Where I is the identity matrix. Only the diagonals of the identity matrix are 1, with all other values being 0. Note that because U is not square we cannot say that U Transpose(U)=I, so U is only orthogonal in one direction.

Using the orthogonality property, we can rearrange (1) into the following pair of eigenvalue equations:

## Numerical procedure

Since Transpose(A)A is the same size or smaller than A Transpose(A), a typical procedure is to plug Equation (3) into an eigenvalue calculator to find V and and then find U by projecting A onto V:

Note that the method is completely symmetric; U and V change places when A is transposed:

Thus, if m < n, we can transpose A, perform the decomposition, then swap the roles of U and V.

In this case, U will be an m × m square matrix since there can be at most m non-zero singular values, while V will be an n × m matrix.

## Exercises

1. Use Equations (2) and (3) to show that both U and V are orthogonal and that the eigenvalues, {sᵢ²}, are all positive.
2. Show that if m < n there will be at most m non-zero singular values.
3. Show that the eigenvalues in Equations (2) and (3) must be one and the same.

# Understanding SVD

Above is just the dry, technical description. It doesn’t give us an intuitive feel for what the method is doing. So let’s imagine the simplest example in two dimensions. It generalizes very naturally to higher dimensions.

Suppose we have two, two-dimensional vectors, x₁=(x, y), and x₂=(x₂, y₂). We can fit an ellipse with major axis, a, and minor axis, b, to these two vectors as shown in the figure. But to make things easier on ourselves and save typing, we write out the equations using matrix algebra.

We can construct an ellipse of any size and orientation by stretching and rotating a unit circle.

Let x’=(x’, y’) be the transformed coordinates:

where R is a rotation matrix:

and M is a diagonal matrix containing the major and minor axes:

Lets write this out term-by-term, both for the general case:

where mᵢ is the ith diagonal of the matrix, M, and for the two-dimension case:

Note that the rotation is clockwise, opposite the usual sense because we are going from the untransformed to the transformed coordinate system rather than the other way around. For the resulting ellipse, the angle will be in the usual, counter-clockwise sense.

The equation for a unit circle is as follows:

We wish to fit a set of x’s, which we collect as the rows of a matrix, X:

The resulting matrix equation is given:

This is just a rearrangement of equation (3). If we regard A as a collection of points, then the singular values are the axes of a least squares fitted ellipsoid while V is its orientation. The matrix U is the projection of each of the points in A onto the axes.

# Applications

The sum of the squares of the singular values should be equal to the total variance in A. Thus, the size of each tells you how much of the total variance is accounted for by each singular vector. You can create a truncated SVD containing, for instance, 99% of the variance:

where p<n is the number of singular values that we’ve decided to keep. Typically, this will be fewer than the top ten (p=10) singular values. It is in this facility of singular value decomposition to exclude the least significant components of A that much of its power lies.

## Solving matrix equations

Some more rearrangement of (1) shows that SVD can be used for solving systems of linear equations:

or, in summation notation:

If this was all there was to it, there would be little to recommend SVD over simpler matrix solvers, such as QR decomposition or even Gaussian elimination. In many cases, however, the matrix will be ill-conditioned, making the solution unstable so that it blows up or produces floating point overflow. This will show up in the singular values: the smallest ones will be very close to zero as measured relative to the largest. To produce a stable solution, we just throw these components away as in Equation (6), above.

It also generalizes to non-square matrices. Since an m × n matrix, where m > n, will have only n singular values, in SVD this is equivalent to solving an m × m matrix using only n singular values. For non-square matrices, matrix inversion using singular value decomposition is equivalent to solving the normal equation:

and produces the solution for x that is closest to the origin, that is, of minimal length. The normal equation is the solution to the following minimization problem:

Thus, they are both generalized, linear, least squares fitting techniques.

## Data reduction

A typical machine learning problem might have several hundred or more variables, while many machine learning algorithms will break down if presented with more than a few dozen. This makes singular value decomposition indispensable in ML for variable reduction.

We have already seen in Equation (6) how an SVD with a reduced number of singular values can closely approximate a matrix. This can be used for data compression by storing the truncated forms of U, S, and V in place of A and for variable reduction by replacing A with U. Results will need to be transformed back to the original coordinate system by multiplying with S and V in accordance with Equation (4).

A full example, including computer code, will be worked out in the Example section, below.

## Exercises

1. Show that Equation (7) is equivalent to Equation (8), the normal equation.
2. Use least squares minimization in Equation (9) to derive the normal equation.

# Example: variable reduction

Suppose that the m × n matrix, A, stores a set of training data with each training vector taking up one row as in (5) and that n, the dimension of each vector, is very large.

We want to feed A to a clustering algorithm that outputs a fixed number of cluster centers. Because n is large, however, the algorithm takes too long or is unstable, so we want to reduce the number of variables using SVD. We want to do this in a way that’s transparent so it looks like we are still working in the un-transformed coordinates.

## Step 1: Reading in the data

To begin with, we will need to read in the data to fill up A. I’m not a big fan of Python and think that C or C++ are better languages for machine learning applications.

`//subroutine header for performing cluster analysis:#include "cluster.h"//maximum number of clusters:#define MAX_CLUSTER 10int main(int argc, char **argv) {  char *infile;            //input file  char *outfile;           //output file  FILE *fs;                //file pointer  double **a;              //matrix of training data/U  int m;                   //number of rows in matrix  int n;                   //number of columns in matrix  int nsv;                 //number of singular values  if (argc!=4) {    printf("syntax: cluster_svd nsv train centers\n");    printf("  where:\n");    printf("nsv      = number of singular values (0 = use untransformed data)\n");    printf("infile   = ASCII input file containing training data\n");    printf("output   = ASCII output file containing cluster centers\n");    printf("\n");    printf("file format:\n");    printf("- one line header containing number of rows and number of columns\n");    printf("- row major list of each matrix element\n");    exit(1);  }  if (sscanf(argv, "%d", &nsv)!=1) {    fprintf(stderr, "Error parsing first command line argument\n");    exit(1);  }  infile=argv;  outfile=argv;  fs=fopen(infile, "r");  if (fs==NULL) {    fprintf(stderr, "Error opening input file, %s\n", infile);    exit(1);  }  if (fscanf(fs, "%d %d", &m, &n)!=2) {    fprintf(stderr, "Format error in input file: %s line 0", infile);    exit(1);  }  if (nsv>n) {    fprintf(stderr, "Command line parameter nsv=%d out of range\n", nsv);    exit(1);  }  a=new double *[m];  a=new double[m*n];  for (int i=1; i<m; i++) a[i]=a+i*n;  for (int i=0; i<m; i++) {    for (int j=0; j<n; j++) {      if (fscanf(fs, "%lg", a[i]+j)!=1) {        fprintf(stderr, "Format error in input file, %s, line %d\n", infile, i);        exit(1);      }    }  }  fclose(fs);`

## Step 2: Performing SVD

We are using a canned singular value decomposition routine that is contained in the header file, svd.h:

`#ifndef SVD_H#define SVD_H//subroutine for singular value decomposition:int                       //returns an error code (0 for success)     svd (double **a,     //input matrix--replaced by U on output                int m,        //number of rows                int n,        //number of columns                double *s,    //singular values                double **vt); //V--right singular vectors#endif`

SVD routines are often more complicated than this, particularly in regards to the matrix and vector types used, but it would be straightforward to encapsulate the whole thing in a “wrapper” function.

Using the singular value decomposition routine is equally straightforward. Continuing from the previous section:

`  double *ave;  double *s;               //singular values  double **vt;             //right singular vectors  //first we calculate and remove the arithmetic means:  ave=new double[n];  for (int i=0; i<n; i++) ave[i]=0;  for (int i=0; i<m; i++) {    for (int j=0; j<n; j++) {      ave[j]+=a[i][j];    }  }  for (int i=0; i<n; i++) ave[i]/=m;  for (int i=0; i<m; i++) {    for (int j=0; j<n; j++) {      a[i][j]-=ave[j];    }  }  if (nsv>0) {    //make space for singular values:    s=new double[n];    //make space for right singular vectors:    vt=new double *[n];    vt=new double[n*n];    for (int i=1; i<n; i++) vt[i]=vt+i*n;    //perform the decomposition:    int err=svd(a, m, n, s, vt);    if (err!=0) {      fprintf(stderr, "Error in svd subroutine\n");      exit(err);    }  }`

## Step 3: Performing the cluster analysis

The operation of the clustering algorithm generates a set of c cluster centers, {𝝁ᵢ ; i ∈ [1, c]}:

`#ifndef CLUSTER_H#define CLUSTER_Hint                            //returns number of cluster centers    cluster (double ** x,      //training vectors                int m,         //number of training vectors                int n,         //dimension of each vector                int max_nc,    //maximum number of cluster centers                double **mu);  //returned cluster centers#endif`

In the third part of the program, continuing from above, we generate the cluster centers:

`double **mu_p;      //matrix of cluster centers  int nc;           //number of cluster centers  //make space for cluster centers:  mu_p=new double *[MAX_CLUSTER];  mu_p=new double[MAX_CLUSTER*n];  for (int i=1; i<MAX_CLUSTER; i++) mu_p[i]=mu_p+i*n;  if (nsv>0) {    //make space for cluster centers:    nc=cluster(a, m, nsv, MAX_CLUSTER, mu_p);  } else {    //make space for cluster centers:    nc=cluster(a, m, n, MAX_CLUSTER, mu_p);  }  if (nc <= 0) {    fprintf(stderr, "Cluster algorithm failed");    exit(-1);  }`

Because the clustering algorithm used the transformed training data, cluster centers will be in the transformed system:

or:

Writing it out component-by-component:

where p is the number of coordinates in the reduced system.

## Step 4: Storing the results

We want to store the cluster centers in the un-transformed coordinate system, thus we want to apply Equation (10).

`  double **mu;        //cluster centers in un-transformed coords  //allocate space for the un-transformed cluster centers:  mu=new double *[nc];  mu=new double[nc*n];  for (int i=1; i<nc; i++) mu[i]=mu+i*n;  //perform the coordinate transformation:  for (int i=0; i<nc; i++) {    for (int j=0; j<n; j++) {      mu[i][j]=ave[j];      if (nsv>0) {        for (int k=0; k<nsv; k++) mu[i][j]+=vt[k][j]*s[k]*mu_p[i][k];      } else {        mu[i][j]+=mu_p[i][j];      }    }  }  //write the results to a file:  fs=fopen(outfile, "w");  if (fs==NULL) {    fprintf(stderr, "Error opening output file, %s\n", outfile);    exit(1);  }  fprintf(fs, "%d %d\n", nc, n);  for (int i=0; i<nc; i++) {    for (int j=0; j<n; j++) fprintf(fs, "%16.8lg", mu[i][j]);    fprintf(fs, "\n");  }  fclose(fs);  //clean up:  delete [] mu;  delete [] mu;  delete [] mu_p;  delete [] mu_p;  delete [] ave;  delete [] a;  delete [] a;  if (nsv>0) {    delete [] s;    delete [] vt;    delete [] vt;  }  return 0;}`

## Exercises

1. How would you improve the example program?
2. Do you think the example program should be split into two? If so, how would you split it?
3. Encapsulate the vector multiplication routine in a subroutine called matrix_times_vector. Design the calling syntax for the subroutine.
4. Using struct, typedef or class, encapsulate both vectors and matrices into a pair of abstract types called vect and matrix, respectively. Design an API for the types.
5. Find at least three libraries online that do the above.
6. Find at least two libraries for performing SVD.
7. Encapsulate them within the svd subroutine, above.
8. Find at least two libraries for performing eigenvalue decomposition.
9. Encapsulate these within the svd subroutine.
10. Find an eigenvalue library with the option to return only the top k eigenvalues/vectors. Encapsulate this within the following singular value decomposition subroutine:
`int svd(double **a,          //matrix -- replaced by u on output  int m,                     //number of rows  int n,                     //number of columns  int k,                     //desired number of singular values  double*s,                  //singular values  double **vt);              //right singular vectors`

# Conclusion

In this tutorial, we have defined singular value decomposition and shown just a tiny fraction of the uses to which it can be put. The method can also be used to retrieve atmospheric variables from satellite measurements; it can be used to interpolate sparse measurements; and it can be used directly as a machine learning technique both for classification and regression as well as many, many other things. See if you can come up with something for your own project.