How does PCA (Principal Component Analysis) really work?
Demystifying the mathematics behind the working of Principal Component Analysis, why it is important, and how to interpret its results.
PCA or Principal Component Analysis is an unsupervised algorithm used for reducing the dimensionality of data without compensating for the loss of information as much as possible. By extracting only the important variables and then creating new uncorrelated ones to maximize variance, PCA helps to tackle issues like the curse of dimensionality and over-fitting, among others.
The code for implementing PCA (in either Python or R) is rather simple and easily available. Hence, I won’t waste time on that but will instead like to focus on the math that goes behind the successful functioning of the algorithm.
At this stage, I think it is important to give some attention to what we call the curse of dimensionality before diving deep into PCA, as it will help us realize the importance of employing PCA and when it’s necessary to do so. And unless we are convinced that something is indeed useful and important, it’s hardly worth learning about.
What is this “curse”?
Two assumptions form the basis of semi-supervised learning: embedding (that the underlying structure of a high dimensional data is often much simpler) and continuity (that data points near each other are more similar to each other than those far apart). These 2 assumptions, along with some results from statistics, give rise to several problems when dealing with data having way too many dimensions.
- As the number of dimensions grows, the data grows exponentially, likely affecting either or both performance speed and quality. For n observations with m dimensions each, we essentially have nᵐ pieces of information.
- Not all dimensions are truly independent of each other, but are correlated and violate important statistical assumptions.
- In extremely high dimensions, with the increase of vacant spaces, data becomes much scattered. Nearly all observations become equidistant from each other, thus nullifying the effectiveness of distance metrics.
So, how does PCA work?
Now that we know why PCA is important, let’s get behind its math and how it really works, with the help of some visualized examples. After all, why should we blindly trust an algorithm and take for granted that whatever it is doing is indeed correct ?!
I’ll first take you through a simple 2-D example and then illustrate how it works similarly in case of 3-D or even higher-dimensional datasets.
Why 2D? Because It’s the easiest for me to illustrate and also for you to visualize. After all, beyond 3 dimensions, things are more intriguing than clear to humans. As shown below, we have the following data given and the same are plotted on the corresponding 2-D plane.
Now, what we will do is take the mean of all the observations along the 2 variable axes and use them to calculate the center of the data. After that, we will shift the observations along the plane in such a way that the center of the data is coincident with the origin of the 2D plane.
An important thing we must not overlook here is that this transformation does not change the relative position of the points with respect to each other. The rightmost point remains the rightmost, the bottommost remains so, and so on. This I mentioned just for the sake of clarifying that the shifting of points doesn’t affect the relative distribution of data and hence, is an acceptable operation. The reason for doing the same will soon become clear.
Finding the first principal component:
Now, we will try to find the best fit line for the points, passing through the center. How the best fit line is chosen is quite similar to other techniques that need to employ a line of best fit, using the method of OLS (ordinary least squares). However, the necessity of the line passing through the center allows us to tweak that around a little bit, to make things a bit easier. As shown below, any random line is chosen at first and the points are projected onto it.
What we do next is measure the distance from these projected points to the origin(labelled in the picture above as dᵢ’s), square them, and then maximize its sum. Yes, maximize! Intuitively, this is just the same as minimizing the sum of squared distances from the points to the line. (I hope you can figure it out using Pythagoras’ Theorem). Both ways would have yielded the same result and thus an effective optimization but I preferred to elaborate the exact method used by the PCA algorithm.
*Line of best fit is the one for which d₁²+d₂²+d₃²+d₄²+d₅²+d₆²= Maximum*
Finally, we have obtained our first principal component or PC1. All we need to do now is to find out the slope of the line. Say, for the sake of argument, the slope of PC1 for our data thus obtained is 0.25. What this signifies geometrically is that for every 4 unit steps along the Var1-axis, we move 1 unit step up the Var2-axis. In the context of our data, PC1 is simply a combination of the 2 variables. One way to look at it is like that of ingredients for a chemical solution: we need to mix 4 parts of Var1 and 1 part of Var2 to obtain the accurate chemical solution.
Now is a suitable time to talk about certain important terms. One unit of the line of best fit is called the eigenvector. The lengths of the corresponding components are called the loading scores. For our example (Slope of line = 0.25), we can calculate as follows:
This 1 unit length best fit line is known as the eigenvector or the singular vector of PC1.
The corresponding values of 0.97 and 0.242 are known as the loading scores of Var1 and Var2 for PC1. (As expected, the proportions remain unchanged and thus we can think of mixing 0.97 parts of Var1 with 0.242 parts of Var2).
The original sum of squares value that had been obtained is the eigenvalue.
Finding the second principal component:
Now we are done with the first major step, but we are yet to reach the final desired position. For that, we need to find the second principal component too. (Since there are only 2 variables, we can have 2 components at max.)
Now is the correct time to talk about a very important property of PCA.
All the principal components of any dataset will always make up a mutually orthogonal system. Simply explained, this is because principal components are essentially a linear combination of multiple (or all) variables of the dataset. The covariance matrix is symmetric, and symmetric matrices always have real eigenvalues and orthogonal eigenvectors.
For a detailed mathematical explanation, kindly refer to the following link: https://stats.stackexchange.com/questions/130882/why-are-principal-components-in-pca-eigenvectors-of-the-covariance-matrix-mutu
Thus, PC2 for our case will just be a line passing through the origin and perpendicular to PC1 which we have already found out. Calculating in a way similar to as explained above, we get the following result for PC2: -1 part of Var1 must be mixed with 4 parts of Var2.
After that, we just need to rotate everything and invert back the projected points away from the lines. And we are done and dusted with our PC Plot.
But wait, how does this help us? We started with only 2 variables and here again, we end up with 2 principal components again! So, how is this going to be of any use to us? For that, we need to check how much of the total variance in the data is explained by each of the principal components individually.
Let SS₁ and SS₂ be the sum of squares for PC1 and PC2 respectively. These sum of squares divided by (sample size minus 1) will give us the variation around the origin for each of them. From them, we can now calculate the percentage of variation explained by each component.
For this example, let us consider that SS₁= 20 and SS₂ = 5.
Then % of variation explained by PC1 alone: 20/(20+5) = 0.8 = 80%
and % of variation explained by PC1 alone: 5/(20+5) = 0.2 = 20%
(Point to be noted: since the sample size is the same throughout, simply the SS score is enough to calculate the % variation. We can skip the division step)
Hence, we can conclude that PC1 alone can be sufficient enough to explain most of the variance in the entire data. And this exactly was our aim initially — to explain as much of the variation possible using a lesser number of variables.
Now that we have seen how the math works, let’s try it out for a 3-D example!
PCA for 3-D Data:
Following steps similar to the 2D case, we obtain the first principal component as shown above. However, since we are now dealing with 3 variables, we can have up to 3 principal components — of course mutually perpendicular and each of them passing through the origin.
Consider, SS₁, SS₂and SS₃ are 79, 15 and 6 respectively. Then, the percentage of variation explained by the 3 components are 79%, 15% and 6% respectively. Consequently, PC1 and PC2 together can explain 94% of the variance in the data. The choice of the threshold can vary from user to user as well as depending on the problem at hand. But, the most important conclusion is that we have been successfully able to reduce the number of variables!
Generalization and Conclusion :
I hope I have been somewhat able to capture the main essence behind this precise yet beautiful algorithm. However, the data sets I used are completely absurd and don’t require reduction. The real power of PCA can be realized while dealing with real-world datasets which can easily have tens or even hundreds of variables. Quite obviously, not all of them will be useful for our task and PCA is often the go-to technique for reducing the variables yet not losing much information.
A tool comes very handy in this regard: Scree Plots. They describe the cumulative variation of the data explained by the first ’n’ Principal Components. The user can then choose which ones to use and which ones to discard.
In the accompanying plot, 7 components are enough to explain over 80% of the total variance and is a considerable decrease from the 11 variables originally present
Now, I would like to come back to a few things which I didn’t cover in detail above, lest the flow of things be disrupted. The terms I had defined, namely eigenvector and eigenvalue, are of some value. An eigenvector represents direction while an eigenvalue represents magnitude (or importance). Naturally, bigger the eigenvalue, more important is the corresponding direction (eigenvector). This is in coherence with our earlier result where we found out that the PC having the largest eigenvector can explain the overall variance of data most effectively.
Lastly, nothing in data science is of use unless you know how to implement it properly. No point wasting time explaining the lines of code but I will just leave a couple of resources that are easy to access and understand:
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
P.S. This is my first article on Medium. A big thanks to you if you have been patient enough to read through the entire piece. Please feel free to leave your honest feedback, point out technical and typographic errors present in this writeup, or clarify your doubts and queries, if any, in the comments section. Cheers :)