Principal Component Analysis Simplified.. !

Sumaya Bai
Analytics Vidhya
Published in
6 min readSep 17, 2021

Through this story, I would like to explore what is probably one of the important algorithms of Unsupervised Learning which is Principal Component Analysis (PCA).
PCA algorithm is pretty technical but I have tried to explain it in a simplified form.
Going ahead I'll be presenting the theory of this method, mathematical intuition, and major advantages and drawbacks of this method.

Before heading towards PCA I want to shed some light on Dimensionality Reduction.
Let’s understand what it actually means.
Dimensionality Reduction : It simply refers to reducing number of input variables or features in the training dataset. Having a lot of variables may lead to overfitting the model and we may also face issue while studying the relationship between the variables.
So our main problem is how to select the variables from whole lot of input variables. Technically speaking, how do we reduce the dimension of our feature space. This particular case is known as Dimensionality Reduction.

There are various techniques to achieve dimensionality reduction. However for the sake of simplicity I'll be concentrating on most popular two techniques :
1) Feature Selection : Feature selection uses statistical or scoring techniques to select which features to keep or which features to be deleted.
Feature selection is also known as feature elimination because we tend to reduce the feature space by eliminating the features. We try to eliminate or filter the redundant or unwanted features from our dataset.
2) Feature Extraction : Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information.
There are many algorithms that have built in feature selection and feature extraction.
Principal Component analysis is a technique for feature extraction.

Now that we’ve an idea of what dimensionality reduction is, let’s dive in and get a grip of PCA.

So as mentioned above, PCA is a dimensionality reduction algorithm, which means it reduces the dimensionality of large data sets, by transforming a large number of input variables into a smaller one that will still contain most of the information in the larger number of input variables.
PCA is also used as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more.

The main idea behind PCA is pretty clear; It reduces the number of input variables in the dataset while preserving as much as information as possible.
This is why PCA is considered as technique for feature extraction!

That’s all about the theory of PCA. Next we will understand how a PCA works.

I will be explaining the mathematical intuition behind PCA with the below example. I’ve used a 2- D data because it is easier to visualize.

The above image depicts the data of two variables on a scatter plot.
Step 1 : Find the center of the data. For this purpose we’ll take the mean of all the observation along the two variable axes.

Finding the center of the data.

Step 2 : Once the center is obtained we will move the observation in such a way that the center coincides with the origin of the plane.

Center coincides with the origin of the plane.

Step 3 : We have to find the first Principal Component(PC1). We have to find a best fit line for the data points which will pass through the center. We can choose any random line and the data points are projected on to that line.

Finding the best fit line.

Step 4 : Now we need to find the distances between the projected points to the origin and square them and maximize its sum as shown below.

Finding the distance between projected points to the origin.

So for a PCA algorithm the line of best fit is the one when the sum of all the distances of projected points to the origin is at its maximum.

Step 5 : Now we need to know the slope of the line. Let’s assume that our slope is 0.25 which means that the best fit line consist of four parts of variable 1 and one part of variable 2.

Here, B =4 and C = 1. Now with the help of Pythagoras theorem we can easily find out the value of A.
PCA scales these value so A become a unit length vector, which makes A = 1, this unit vector A is the Eigenvector!

Scaling the line down to 1 unit.

The sum of squared distances of projected data points to the origin (i.e d1,d2,d3,d4..) is the Eigenvalues!
So from the above little example we can understand that for PC1, Variable 1 is almost four times as important than variable 2.

Step 6 : We are almost half way through our destination, but there is this one last stop, that is to find the second principal component(PC2).Two components because we are using only two variables.
Since there is 0 correlation between principal components, Our PC2 will just a Vector which is orthogonal to the PC1 we just found out.
PC2 will just be a line passing through the origin which is orthogonal to PC1 . So PC2 will be calculated as 1 part of Variable 1 must be mixed with 4 parts of Variable 2.

The red line is the PC2.

Now having these 2 components wont be of any use without the Explained Variance.

What’s Explained Variance?
Explained Variance tells us how much variance in data is explained by each principal component.
We can get the respective variance of PC by summing the Squared Distances for both the principal components and divide those values by the sample size of the data.
Principal components are ranked in order of their explained variance . We select top principal components if the total explained variance hits a sufficient value.

Pros and Cons :
Pros :
1) Reduces Overfitting : PCA helps the model not to overfit by reducing unwanted features from the dataset.
2) Helps in Visualization : It is difficult to visualize data which are high in dimension (4D or more), PCA helps to visualize it by reducing the dimensionality.
3) Improves Model Performance : Having too many features will lead the model not to give best and accurate results, but thanks to PCA , It speeds up the machine learning algorithm by getting rid of correlated variables.
Cons :
1) Standardization is a must on the dataset before implementing PCA, else PCA will not be able to optimal principal Components.
2) Chances of information loss if we don't choose the principal components with care.
3) Input variables tend to become less interpretable.

So that’s the end! I hope I was able to give my readers the basic essence of this beautiful algorithm.
I’m embedding some important resources which i had found useful while learning this algorithm:
https://www.youtube.com/watch?v=FgakZw6K1QQ
https://www.youtube.com/watch?v=OFyyWcw2cyM
https://arxiv.org/pdf/1404.1100.pdf

Happy Learning Y’all.. :)

--

--

Sumaya Bai
Analytics Vidhya

Data Scientist in making , Love for statistics and late -realized passion for programming had led me eventually into the world of Data