Principle Component Analysis (PCA)

Published in

unpack

7 min readJan 15, 2021

We would begin our Principle Component Analysis (PCA) by plotting our variables, although PCA can be used for millions of variables its probably easiest two understand with two.

PCA would thereafter take the average of all variable 1 (or x), then the average of all variable 2 (then the average of all variable 3, …, then average of all variable n) in order to center the average values, a1, a2, (…, an) around the origin shifting all the data points with it. Each data point’s x-y values may change, but what is important is that they do not change relative to each other. For example, in the figure below the a1≈4 and a2≈5, to which will center around the origin by subtracting each data point’s x by 4 and y by 5.

Figure 1: Re-centering around the origin. Data points are scaled but same distance relative to each other.

Next PCA generates a random line that goes thru the origin.

Each data point is then projected onto the line, and the PCA algorithm takes note of the distances between the data point’s coordinates and its projected coordinates on the line. For more on “projecting” look at this wiki article.

In theory, PCA could aim to minimize the distances from the points to the line. This is what we do when we are finding a Linear Least Squares (LLS) approximation of linear functions (y = mx + b) to data.

But in practice it is much easier to aim to maximize the distance from the projected points to the origin. Meaning, it’s easier for PCA to find the best fitting line by maximizing the sum of the largest squared distances from the projected points to the origin, rather than minimizing the distance of the real points to the line.

Figure 2: Ex. of PCA algorithm for a single data point. Vector 2 is an arbitrary “best fit line” determined by PCA. Image created by Author.

Put another way with Pythagorus’ theorem (a² + b² = c²), rather than minimizing a² with LLR , we are aiming to maximize b² with PCA. This, through some straight forward algebra, also in turn minimizes a², because b² gets larger, a² must get smaller in order to fulfill the theorem / never changing constant c² (distance from real data point to origin).

So now that it has iterated over and over until it found a line/vector such that it is the optimally maximum combination of the b² of every point — aka in jargon the SS(distances), aka sum of squared distances, aka eigenvalue, aka variance but without the “divide by n-1”– this line/vector with some slope m and offset b (written y=mx+b) is called the Principle Component One (PC1). It will correspond to one variable (maybe variable x, and thus PC1 = the x axis), but we will get to that later.

Figure 3: Formula for Sum of Squared Distances. Image created by Author.

The next step is to apply singular value decomposition (SVD) to this line. Here the first step is to get b² = 1. Meaning, we want to scale it down such that when y = 1 and x = (1-b)/m, we can set the hypotenuse of this right triangle, which is a line that rides along the PC1 until y=1, to equal 1.

So doing the math, if a²=y²=1 and b²=x²= ((1-b)/m)², then c² = 1 + ((1-b)/m)².

Figure 4: Setting up Pythagorus’ theorem. Image created by Author.

In order to get c alone, we take the square root; sqrt(1 + ((1-b)/m)²), and find that c is equal to something complicated, assuming that m≠0 aka that the slope of PCA is not a flat line (1 dimensional). Also since we know that m≠0, the numerator can never be negative, as even if b=1, and m is negative, taking its square (m²) will always be positive and thus the numerator is always positive.

Figure 5: Solving for c. Since c is a line/vector it doesn’t matter if we use its positive or negative representation. Image created by Author.

Now that we we have a value for c, made up of real numbers b and m (from our equation of PC1: y=mx+b, when y equaled 1), we can divide it by itself to get c=1, and the scaled values for a and b.

Figure 6: **Note:** the b on the left hand side is the b in Pythagorus’ theorem, a² + b² = c² aka the distance along the x axis. Not to be confused with the b on the right hand side, which is the b found in the PC1’s y=mx + b. Image created by Author.

This 1 unit long vector we just calculated, consisting of x axis value b*x1 and y axis value a*y (or a*x2) per one unit, is called the Singular Vector or Eigenvector for PC1. And the square root of the of the SS for PC1: sqrt(SS(distances for PC1)) is called the Singular Value for PC1.

Now that we got PC1, lets solve for PC2. It’s actually not as hard as you think, PC2 is simply the line through the origin that is perpendicular to PC1. If we were in a 3D space, rather than a 2D space, it would be the plane orthogonal to the PC1-PC2 plane with respect to the origin.

Utilizing the solutions for a, b and c we had solved for Pythagorus’ theorem in PC1, PC2 (the line perpendicular) has x=-a and y=b.

Now we have two lines/vectors that we can project our real data onto. Just so that it easier on our eyes, we can then rotate the axes so they align with our standard x-y plane, since we know both PC1 and PC2 have been standardized with 1 unit long vectors. In linear algebra we would do this through “translating into the identity matrix” or a “change of basis”.

We could continue this set of steps (find an orthogonal vector, SVD, rotate) for n dimensions. After that we only 3 things left to do, shouldn’t take more than a minute.

Using SS(distance) for PC1 and the SS(distance) for PC2 (etc.), we can calculate the variance. This is through the simple formula: SS(distance)/(n-1), where n is the number of data points we have.

With these numbers, we can see that PC1 will account for X% of the variance in our n data points and PC2 will account for (100%-X%) of the variance in our dataset. Since we know each of these correlate to a variable, it would be interesting to see which (set of) variables

The last thing to do is plot them as a bar graph in descending order of variance size. Depending on how accurate you want your PCA reduced dataset to be (eg >95%) you would take the combination of {PC1, PC2, …, PCn} such that this threshold is met and discard the rest. What you‘re left with are only the principle components, great analysis!

Source: Llewelyn Fernandes “Perform an Exploratory Data Analysis”

This is especially useful if you can reduce a dataset with 4 or more variables (aka 4th dimension or higher) into something — albeit less accurate — that we can visualize in 2D or 3D. The ultimate goal of PCA is to identify and detect correlation between variables. If it finds a strong correlation, you could reduce the dimensionality by keeping only one — the principle component.

Conclusion

Source: Pablo Bernabeu “Naive Principal Component Analysis in R”

Lets break it down into tl;dr steps –

Step 1: “Plot” the data points (quotations bc it isn’t possible in 4D+)

Step 2: Average each of the variables

Step 3: Recenter so that the average of each variable corresponds to the origin point (0, 0, …, 0)

Step 4: Generate a random vector that runs through the origin

Step 5: Play around with the values until the sum of the distances between the projected points on the vector to the origin, aka SS(distance) is maximized. This vector is called PC1.

Step 6: Apply Singular Value Decomposition (SVD) to PC1 so that the vector can be decomposed into lengths of 1 single unit. This vector is called the “Singular Vector” or “Eigenvector” for PC1.

Step 7: Find a vector orthogonal to PC1 that goes through the origin. This vector is called PC2.

Step 8: Apply SVD to PC2

Step 9: Rotate the vectors PC1 and PC2 so that they align with our basis vectors for that dimension (standard x-y plane).

Step 10: Repeat Step 7, 8, and 9 until you get to the nth dimension. (e.g. for dimension 3, find a plane orthogonal to the PC1-PC2 plane, called PC3, apply SVD to PC3 and recenter around the origin / our basis {1, 0, 0}, {0, 1, 0}, {0, 0, 1}).

Step 11: Find the variance of PC1, PC2, …, PCn

Step 12: Plot the variances with a bar graph in descending order

Step 13: Reduce the dimensions of your dataset to a threshold you are willing to accept (e.g. >95%) by throwing out all the insignificant Principle Components (PCi, PCj, PCk, etc) with a combined variance of <5%.

Step 14: If you are down to 3 Principle Components or less, plot the projected data along the x-y-z plane. Sometimes even if the 3 PCs don’t add up to the threshold, it may still be useful to plot combination of the top variances to see what clusters or trends you can identify by eye.

Step 15: congratulate yourself for analyzing the principle components, and follow me! ;)

Principle Component Analysis (PCA)

Conclusion

Written by Viceroy