Principal Component Analysis — Stepping into the details of the math

Sho Nakagome
sho.jp
Published in
6 min readFeb 22, 2019

--

Today, we going to dive a little deep into a famous dimensionality reduction technique called Principal Component Analysis (PCA).

If you don’t know what PCA does, check the following video (5 mins)

Or if you like reading texts instead, check below:

In short, PCA projects the data onto a low dimension by maximizing the variance (or minimizing the distance from the chosen axis) of the data in each principal component axis.

First, let’s see the common way of performing PCA using eigendecomposition.

Let’s begin by defining a data matrix X. The dimension of X is “n” by “f”. You could think of it like the data matrix has “n” number of samples and “f” number of features.

Before running PCA, it’s always better to center the data, because in the end, PCA is just rotating the data around to project the data on top of principal components.

Of course, we only subtracted the mean from each feature column, so the dimension hasn’t changed.

Now, if you read how to calculate PCA, one of the common steps you see is you calculate the sample covariance matrix using the centered data.

And you run either eigendecomposition or singular value decomposition (I’m going to explain this version in the later part of the story).

In short, the eigenvector matrix Q is the matrix that contains principal components and Λ is the eigenvalue matrix (eigenvalues corresponding to the magnitude of PCs so you would order them from large to small so that the larger the eigenvalue is, the more important the corresponding PC is).

So to reduce the dimensionality, you would just pick a few principal components, let’s say “s” of them. Now you have selected numbers of principal components matrix Q_s.

In the end, if you want to project the original data matrix onto the principal component's axes, you would just multiply the data matrix with the above principal component matrix to get the following.

That’s it! Easy!

Wait a minute…

Yeah, calculating PCA or implementing PCA yourself is not that hard. But I was always wondering how does:

“Maximizing variance along the principal components”

corresponds to

“Performing eigendecomposition of the covariance matrix”

Apparently, there are fewer articles explaining this. One of the few I found is the links I posted earlier and let me explain that here.

Suppose you have a principal component vector v.

Now, let’s use that to project the data onto the PC.

If this is the projected data, the variance should be maximized, right? Because that was the whole point of PCA.

So first, let’s calculate the sample variance.

Since the whole purpose of PCA is to maximize the variance along the projected axis, we are going to do that in the equation too. But we need to be careful. If we don’t set the constraints, say have very large “v”, then we can easily maximize the variance “Q”. To avoid this, we are setting a constraint to make it a uniform vector.

Now, we have a problem.

  1. We want to maximize “Q”
  2. But at the same time, we are constrained by “v

To solve this kind of problem, people usually use Lagrange multiplier.

(Note that to be more precise, you need to also check other derivatives too, but we are just ignoring those here.)

After solving the Lagrange multiplier, to maximize variance, the vector “v” has to be the eigenvector of the covariance matrix! This is where the idea of solving eigendecomposition is coming from. That’s why you could solve the PCA without doing some iterative optimization.

OK, so at last, what’s the difference between eigendecomposition based PCA vs singular value decomposition (SVD) based PCA?

Eigendecomposition PCA was easy to understand. Both for the proof and solving PCA itself.

However, in practice, most of the implementation is based on Singular Value Decomposition (SVD). Why is that?

If you remember how we computed eigenvectors and eigenvalues, eigendecomposition based PCA needs to calculate the covariance matrix first. The covariance matrix was “f” by “f”. If “f” is small, it won’t be a problem. But imagine a huge “f”, let’s say you are dealing with an image. “f” suddenly becomes a huge number, more than a million in some cases. Then computing the covariance becomes infeasible. This is where SVD based PCA can help.

Let’s see why.

If you remember the SVD we covered last time, the centered data can be decomposed into three different matrices. “U” and “V” are unitary matrices and “Σ” is a diagonal rectangular matrix.

Let’s see what happens if we follow the same logic as we did in the eigendecomposition.

Turns out, the matrix “V” is the eigenvector matrix we needed! Therefore, if we were to do the same and project the data on to the principal components, we can completely skip the calculation of the covariance matrix because the following equation is just the altered equation of the SVD of the centered data matrix! (You just need to multiply the matrix “V” from the right side on X_c = UΣV’ )

Summary

  • PCA projects the data onto the principal components where the variance of the data is maximized (or the total distances from each point to the axis is minimized)
  • You can derive why PCA is all about solving eigendecomposition of the covariance matrix using Lagrange multiplier
  • Practically, people use SVD to calculate PCA because of computation efficiency

That’s all! Hope it helps! See you next time!

--

--

Sho Nakagome
sho.jp

A Neuroengineer and Ph.D. candidate researching Brain Computer Interface (BCI). I want to build a cyberbrain system in the future. Nice meeting you!