Principal Component Analysis, But Why?

Bhanu Kiran
5 min readJan 10, 2023

--

If you are reading this blog, I’m sure you are aware of Exploratory Data Analysis or EDA. If not, then in simple words, it’s an approach for maximizing insights on the data before formal modeling. There is a reason, however, why we perform EDA, some include:

  1. to guide hypothesis testing
  2. assess our assumptions about data
  3. identify essential features of data
  4. uncover hidden structures
  5. graphical analysis

And the list goes on and on, but one thing EDA can be done is for unsupervised learning methods. What is unsupervised? When your data does not have a target or output variable. In other words, your data is filled with features and nothing to map these features, unsupervised include famous methods such as cluster analysis, and of course, PCA or Principal Component Analysis.

Principal Component Analysis

Why did I mention it with EDA and not as a model itself? Because it can be part of EDA and it can be an unsupervised learning method as well, just depends on how you use it.

To get a better understanding of what PCA is let us break down the term principal component. The term principal component means a linear combination of the predictor variables. The idea in PCA is to combine multiple numeric predictor variables, also known as features, into a smaller set of variables, which are weighted linear combinations of the original set.

In simpler words, I have 6 columns, of my features, then I go ahead and make these 6 columns into 3/2 columns that “explain” that explains most of the variability of all the 6 columns, reducing the dimension of the data.

variability here means the extent to which data points in a statistical distribution or data set diverge — vary — from the average value, as well as the extent to which these data points differ from each other. Because often, variables will vary together, and a variation in one variable is duplicated by a variation in another.

Let’s take an example of two variables/features X and Y. Now, for these two variables there are two PCs of principal components PCi ( i = 1 or 2 ).

PCi= wi,1 X + wi,2 Y

Fig .1 Principal components

The weights wi,1 and wi,2 are known as the component loadings, and these transform the original variables into PCs (principal components).

The first principal component, PC1 is the linear combination that best explains the total variation, and the second principal component PC2 is orthogonal to PC1 and explains as much of the remaining variation as it can.

if there were additional components, each additional one would be orthogonal to others

In general, PCA is about generating new variables so that they are better at explaining our data, and PCs have 3 properties:

  1. unique
  2. orthogonal
  3. linear combination of the original attributes

Let’s take an Example. Consider that our data has 4 attributes/variables/features. Hence, we derive 4 PCs.

The PCs are ranked according to the variance, where the first PC is the highest variance and the second goes down in variance, and so on.

Fig 2. Scree Plot

Now, our data, instead of having the 4 columns which originally exist, it should have 4 columns, which are PC1, PC2, PC3, and PC4. Each row of the column is a score for the transformed attribute values for each data point.

Visual Data Exploration by PCA

The most common use case for PCA is for feature extraction, and the other common use case for PCA is for dimension reduction.

Since the PCs are ranked by variance as seen above in Fig 2. the top PCs are signals, and the bottom PCs are background noise. which arise due to little variation or background noise.

From the plot in Fig 2. and the transformed dataset, how do you select the number of PCs to keep?

Well, one trick is to look at the plot of variance also known as the elbow method, which tells us how many PCs to keep depending on the curve, which forms the elbow.

Fig 3. Elbow method

This method ensures we have the right number of principal components for our analysis.

As far as dimension reduction is concerned, you can interpret the data as such. You have 4 columns and no target column, and you would like to find patterns and analyze the data to find hidden structures, well the first step for this is to plot the data. But how are you going to plot 4 columns in a 2 axis plot? Well, PCA is your answer, you can refer to Fig 1. where you have X and Y and you have PC1 and PC2, now you can go ahead and go about with your cluster analysis.

--

--