Understanding Principal Component Analysis

Rajinish Aneel Bhatia
CodeX
Published in
10 min readAug 20, 2021

The primary goal of this blog is to put an emphasis on the fact that coming across hurdles when learning something new should be an excuse to learn more about the prerequisites rather than an excuse to shy away from the thing you wanted to learn in the first place. What I want to do is build up sort of a natural approach to principal component analysis by asking somewhat natural questions, this will lead us to a calculus free understanding of the topic. Linear algebra, however, will be put to use quite heavily; I’ll try to cover everything I can. First, let’s start with the goal. We are given some data “X” which is a f x m(features x instances) matrix; for the sake of simplicity assume we have 2 features. To move forward let me define some of the notations I’ll be using: @ is matrix multiplication(it’s skipped at some places where the matrix multiplication is tacit, that just makes the equations succinct), a dot(.) is used for dot products. What we want to develop are tools that help us get a better representation of the data. What do I mean by that? Take a look at the data plotted below:

It’s a line. Isn’t it redundant to use 2 dimensions for a clearly 1 dimensional data? This isn’t particularly a great representation in itself; for there’s a lot of redundancy. The data could have just been on the x axis and it wouldn’t have made much difference. Why do we want a lower dimensional representation of the data? In the context of machine learning, fewer features help speed up the training. But in general this can be a quintessential topic for denoising the data: what if the function that generated this in the first place was, as a matter of fact, an equation of a line and the deviations of the data points from the line are random noise? Transforming the data to a lower dimension i.e projecting it onto the x axis and then transforming it back to 2d by maybe rotating the x axis to match the slope of the original line would get rid of the noise. Keeping the uses in mind, let’s start with the objective: What we want to do is remove the redundant dimensions. How do we do that? Well in this case, the answer seems quite obvious: first we rotate our axes so that the x axis lines up with the line that the data’s on then we remove the y component of each of the data points, for it’s almost 0 for each of the data points after the rotation.

The axes were first rotated then the data points were projected on the x axis.

Now the thing is, we want to describe this mathematically so we need to have some mathematically definable objective. First it’s obvious that we want to remove the redundancy from the data, this involves the data being on a line in 2d, on a plane in 3d so on and so forth (We’ll think about discarding the extra dimensions later, and we’re just thinking about linear lower dimensional extractable surfaces here). The reason we’re thinking about these lower dimensional linear surfaces is because, say, the data is on a line in 2d then we could just rotate our x axis to the line and discard the other dimension; if it’s on a plane in 3d then we could rotate the x,y plane to match the plane of the data and discard the other dimension. This is an assumption that PCA makes: that the important data lies on a lower dimensional linear surface; in more practical cases, images for instances, this assumption need not be true and more sophisticated tools and algorithms are developed to solve these problems. Sticking with PCA for now, let’s try and come up with a more mathematical approach. If the data is on a line in 2d that means that the covariance of the data is non zero, right? This comes from how covariance is defined: it measures the trends in the data. Now, this is a particularly great observation because we have the means of measuring covariance. It’s simple to calculate in essence, and if you want to recall the concepts I recommend watching this video, since it helps explain the topic better than I ever could. If I mean center the data X along each feature dimension and call it Xbar, then the covariance is the off diagonal elements of (1/m)(Xbar@Xbar^T) (carry out the matrix multiplication for a 2d data if this confuses you, I’m also skipping the 1/m factor to avoid clutter in further mentions of the covariance matrix.) now what we want is 0 covariance, Why? We know covariance is our measure of redundancy(remember what we mean by redundancy is the data lying on the lower dimensional linear surface: a line in 2d, a plane in 3d, a line in 3d, so on and so forth.), and developing the methods to mold the data so that it has 0 covariance would force the line, or generally a hyperplane, that the data is on to be flattened(First we remove the redundancy then we remove the extra dimensions, in the above example removing the redundancy was rotation of the axes, the removal of redundant dimensions will be explained a bit later). This now, gives us a mathematically definable objective: What we want is a transformation P that transforms Xbar to a new space, call it “Y” , such that the off diagonal elements (the covariances) of Y@Y^T are 0; all this just comes from the objectives we have defined. So let’s write that down: Y=P@Xbar such that Y@Y^T = (P@Xbar)@(P@Xbar)^T is a diagonal matrix; simplifying:

(PXbar)(Xbar)^T(P^T) = P(Xbar)(Xbar)^T(P)^T

let’s write (Xbar)(Xbar)^T as S, we find that P(S)P^T must be a diagonal matrix; if you have some knowledge of the concept of diagonalization in linear algebra this equation might have ringed some bells; because you must have seen that P^(-1)SP , where P is a matrix containing the eigenvectors of S, gives us a diagonal matrix where the diagonal entries are the eigenvalues of S. Again if you’re not familiar with the topic, or if you want to review it, I recommend checking out 3Blue1Brown’s awesome series on Linear Algebra. We have P(S)P^T instead of the other way around(kind of ignoring the fact that we have a transpose in place of an inverse at the moment, this will be explained in a bit); this can be circumvented: just use Y=(P^T)Xbar which gives us YY^T = (P^TXbar)(P^TXbar)^T which is P^T(Xbar.Xbar^T)P or P^T(S)P. Moving on, this is where we hit our first hurdle: there’s a P^(T) instead of P^(-1). What I want do first is figure out for which matrices P, P^T = P^(-1). We’ll then move forward with our analysis. To do so I want to delve deeper into what transposes do, which will make this a bit easier to understand: When we imagine a linear transformation “A” we imagine taking the linear combinations of the vectors in its columns; so for instance, if we have a 3x2 transformation you may imagine it as taking the linear combination of 2 vectors in 3d, which gives us a plane. To visualize transposes, first let’s give names to some stuff; A has two 3d vectors let’s call them v and w. Now, for A^T, we imagine the same two vectors but this time to see where a 3d vector z lands after A^T has been applied to it (A^T is now a transformation from 3 to 2 dimensions) we imagine dotting the vector z with v to get the x component of the transformed vector, and we dot z with w to get the y component of the transformed vector(If this is not obvious, write it on a piece of paper and think about what the matrix multiplication is doing). This, by the way, is also a way of understanding why the column space of A must be perpendicular(orthogonal) to the null space of A^T; I suggest you think about it a bit, if you want to. (if you’re not familiar with these concepts of column spaces, just forget about what I said.) Say z has components z_x, z_y and, say, we applied A to z to get the the transformed vector Az, call it z_t. Recall that v is where the unit vector ihat lands after A is applied to it and w is where jhat lands if A is applied to it; Now if A^T = A^-1 this means that dotting z_t with v gives us z_x and dotting z_t with w gives is z_y; this should be obvious given the above explanation of the transpose. Note that z dotted with ihat is z_x too. What we now have is z . ihat = z_x and, if we say v = ihat_t = A.ihat(remember v is where ihat lands when A is applied to z), then z_t . ihat_t= z_x. This means than A preserves dot products because z before or after the transformation dotted with ihat(pristine and transformed) gives its x component. Preserving dot products means 2 things: angles b/w vectors are preserved and the lengths of the vectors are preserved. This suggests the transformations to be rotations. So for rotation matrices A^T = A^-1. Now for a rotational matrix “R” all the vectors in it are orthogonal to each other, for they are the rotated basis vectors which were orthogonal to each other. Moreover they are of unit length(perpendicular normalized vectors are called orthonormal vectors). In our case we have the matrix S and its eigenvectors are in the matrix P; what we wanted to show was that P^T = P^-1, because if that is true then we can almost diagonalize, there’s more stuff left to prove. This asks the question: Are the eigenvectors of S orthonormal? S is xbar.xbar^T, the only algebraic property noticeable should be that it is symmetric S^T = S. Do symmetric matrices have orthogonal eigenvectors? S also needs to have a full set of eigenvectors for us to be able to decompose it like that. Does it fulfill that requirement ? We also do not want complex eigenvectors, so are the eigenvectors of symmetric matrices real? These, to the best of my knowledge, can be answered algebraically. I, sadly, lack a convincing geometrical argument. I encourage you to stop and look these things up; this was the primary theme of the blog, remember? A symmetric matrix does, as a matter of fact, satisfy all of these properties. Isn’t it surprising that having just this one symmetric property translates into having a bunch of other very useful properties? So there we go, we have our matrix P that will remove any redundancy in the data for us. The matrix is the eigenvectors of xbar.xbar^T stacked horizontally. We now come to the topic of dimensionality reduction which is a just a little addition on top of this. Take a look at this picture below.

Wouldn’t you agree that the y axis is a little redundant? How did you go about figuring it out? How do we mathematically determine the importance of a dimension? Maybe stop proceeding through the blog for a while and see if you can come to a reasonable conclusion. If you take a closer look at the picture above you should be able to see that the data is more spread out on the x axis while it’s compressed on the y axis.

So a natural conclusion to the problem of measuring the importance of a dimension would be to measure the data’s spread along that dimension, right? Which is the variance of the data along that dimension (Why don’t we measure the mean instead?). Think about a few cases if you’re have doubts about the claims made. Now we also know that the covariance matrix already encapsulates the variances along each dimension: the diagonal values of xbar@xbar^T represent the variances along each of the axes. Now since we have transformed “X” into “Y”, we want to look at the diagonal entries of YY^T each of the entries gives a variance along each of the dimension, which is our measure of the importance of that dimension. YY^T is P^TSP, (S is Xbar.Xbar^T) and the diagonal entries of P^TSP are the eigenvalues of S, so the eigenvalues of S give us the importance of each of the dimensions i.e the variance along each dimension. Now what is usually done is this: We need to transform X into Y by applying P^T to X, P is the matrix of the eigenvectors of X.X^T what we could do is get the eigenvalues of X.X^T; which, as defined above are the importances of each of the dimensions, what does that mean? That means if we, say, take two vectors from P with eigenvalues e1 and e2 after we apply P^T to X and get Y, (Y=P^TX) the variances along each of the transformed axes will be equal to e1 and e2. If e2 is sufficiently low, we could have just multiplied X with P’s first eigenvector with eigenvalue e1 so that we wouldn’t get a redundant dimension. If XX^T has f eigenvectors with f eigenvalues then we discard the eigenvectors with low eigenvalues and then stack the eigenvectors into a matrix P and apply that to X and get a non correlated , dimensionally reduced data. The data is transformed back the original space by applying P to Y; X_t = PY. Well P is a non square matrix; ideally we’d want to apply (P^T)^(-1)Y which would be (P^T)^T(Y) or just PY(If P is a rotational matrix then P^T is a rotational matrix too, of course because P^T is P^(-1) which is just the rotation by the opposite angle); but the inverse is not defined for non square matrices. The justification of still using P for inverse transformation comes from the singular value decomposition of the matrix; I won’t go too deep into that right now. When we transform back to the original space after having transformed to a lower dimensional space some variance of the data will be lost; this variance is the sum of the eigenvalues of the eigenvectors that we discarded because they were simply too low. If we assume that this extra variance was noise, which is true for some datasets, then after performing these steps and transforming back to the original space does, as a matter of fact, provide us with noiseless data. These ideas of preserving dimensions of only high variance are important in their own right, if you learn about autoencoders in machine learning it’s vaguely built on the scaffoldings provided by PCA.

--

--