A deep dive into Principal Component Analysis (PCA)

Sayantanee Sen
Analytics Vidhya
Published in
10 min readOct 3, 2020
Photo by Joshua Hoehne on Unsplash

There are three steps to understand Principal Component Analysis: Understanding the Mathematics required for PCA, the Statistics concepts we would need and finally connect all the dots while solving the final Optimization Problem for PCA.

Mathematics for PCA:

Pardon me, because I am going to take you back to high school Mathematics ! So, how much Math do we actually need to understand Principal Component Analysis ?

The answer is : Concepts of Linear Algebra and a few trigonometry rules should be good enough to dive-in into PCA concepts. Let’s start with few basic equations of trigonometry.

Trigonometry Rules you should keep handy:

We will be using cosine of angles quite frequently, hence we need to be very quick with the following rule for a right angled triangle.

Second fundamental rule of trigonometry that we need to know to understand the upcoming steps is: Cos(A-B)=CosACosB + SinASinB.

Finally here is a quick table of all the sine & cosine angles.(I would encourage the readers to revisit the basic formulations of trigonometry and their corresponding proofs)

Linear Algebra : Vectors

Now since we know the basics of trigonometry, let us walkthrough the linear algebra pre-requisites. We will base the entire theory of linear algebra onto vectors, so it becomes very crucial to be able to visualize them. Any vector in nD space is represented by the “n” coordinates. Each vector has a direction in the space and a related magnitude.

Image by Author

What is an unit vector?

A unit vector is a vector with magnitude of 1, which means it is nothing but just the direction. Take a note of this point, we are going to refer it in our PCA techniques.

** ǁvǁ : Magnitude of v

Matrices:

Matrices can be considered as some sort of transformation functions which can be applied on vectors to move their coordinates or to scale them or both. In real world data comes in the form of matrices (certainly not square matrix). For instance, if we need to collect all the patient’s information like height, weight, blood pressure levels, allergic/non allergic to certain drugs etc., we will have each patient record as a row to our data matrix and each of the attributes representing information about the patient can be treated as individual columns. In the above case, there are “m” patients and “n” medical attributes, thus forming a [m X n] (Read as m by n) matrix.

Do you see it yet? Yes, each of our columns represent vectors known as Column Vectors and each of the patient records represent row vectors. Its a general notion of Linear Algebra that we always prefer to deal with column vectors, hence the row vectors can be represented as transpose of the column vectors.

Vector Operations: Dot Product

Dot product of two vectors is given by the below equation. Intuitively, it means projection of a vector onto another vector (shadow casted by light source exactly on top of the vector head, thus, when two vector are perpendicular, the dot products becomes zero ; same as your shadow casted when you stand under the sun with it being right at top vertical position)

Dot Product of two vectors a and b

Here is a simple proof for the above. Don’t forget to refer what we learnt in the above trigonometry section of this article.

**: θ = α — β. Can be proved either ways

And finally, what we need to remember is that how dot product of two vectors can be expressed as below:

Eigen Vectors and Eigen Values:

As stated above, matrices can be treated as transformations which will take a vector in the vector space to a different position. (If you need to witness a really good visualization of such transformations, head towards 3Blue1Brown videos any day!) There are few special kind of vectors which are not moved (no change in direction/rotation) even after applying transformation. However, they can be scaled by a certain value after the transformation. Those vectors are known as “eigenvectors” and the scaling factor is called the “eigenvalues”.

Mathematically, if A is the transformation matrix and v is the eigen vector, then Av = λv, λ is a scalar

Statistics for PCA:

What do we mean by column normalization?

Differences in scales of measurements between various features may cause a wrong depiction of the data. For instance, if we plot a simple scatter plot between different height and weight measurements of all the students in a class, do you think we are going to correctly categorize the student groups? Well, given both height and weight measurement scales are different from each other, we need to find out a way to eliminate this issue.

Exactly this is where Column normalization comes to rescue. In short, this technique helps you to make all the values for all features range between [0,1]. Mathematical way of doing this is as follows:

Let [a1, a2, a3 …..an] are n values for feature “fi”.

Such that: Max(ai)= a_max ≥ ai & Min(ai) =a_min ≤ ai ∀ ai (i: 1 to n)

Now, we are going to derive another data set [a1`, a2`, a3`…..an`], where ai` is given by :

ai` = (ai — a_min) / (a_max — a_min)

Now if we substitute ai with a_min and a_max (one at a time) and calculate corresponding ai`, we will see that ai` varies in the range [0,1].

What do we mean by column standardization?

A dataset is column standardized when the mean is 0 and standard deviation is 1. This basically means to centralized our data points around the origin. Also since the standard deviation is 1, so we either have to squish our data points or expand them to form the new transformed space.

Here is a geometric depiction of both the techniques and a dataset needs to be preprocessed with these technique before we do apply PCA:

Covariance between attributes in a column standardized data matrix:

The final step now is to learn about the covariance concept. Please note whenever we talk about data matrices, we assume that the rows are individual records and columns are different attributes or features. To start with, variance of any one feature means the spread of data points along the feature axis. By that definition, covariance between two features can be understood as how both features vary with respect to each other.

The covariance can be very useful in finding out the relationships between different attributes.

Now since we understand the basic concept of covariance, let’s see how we denote covariance mathematically for a column standardized matrix. General formula for finding our covariance is given by:

Cov(f1,f2) = (1/n) * Σ (xi-x̅)(yi-ȳ) , Where f1 and f2 : Features, xi, yi : data points at ith row & x̅, ȳ : Mean of all x’s and y’s.

Now if the data matrix is column standardized, we can substitute the mean values with 0’s, which makes our equation as Cov(f1,f2) = (1/n) * Σ (xi)(yi), which is nothing but the dot product of f1 and f2 column vectors, which is again (f1)T f2 (f1 transpose multiplied by f2). Now instead of just two column vectors, we can proof the same argument for the entire data matrix A.

Thus, COV(A) =(A^T)(A) [A Transpose is d X n and A is n X d, where n is the number of rows and d is the number of features].

Summarizing all the equations we learnt so far:

Understanding how the above Mathematics and Statistics formulas are used in PCA:

PCA deserves a dedicated write up of its own, however, to connect all the dots so far and to understand how are we going to apply the above knowledge in PCA, let us go through the PCA optimization problems.

In plain English, PCA is a technique to find out the direction of maximum variance in data in d-dimensional space. Once we find out the that direction of maximum variance, we define the first principal component vector (unit vector) in that direction, say PC1. Next step is to draw an orthogonal vector PC2 ⊥ PC1. Remember, number of principal components is equal to number of input features.

Now it would be important to understand why we want to do this? Say, we have two features x and y (as shown in the figure). The variance or the spread is almost equal in both x and y axis, which means we can not drop any of these features without losing significant information.

It’s worth mentioning here that with more and more number of features, we can run into Curse of dimensionality problems (Read about it, if you don’t know it already). Thus, we might often require to drop features from our dataset. This process is known as Dimensionality Reduction.

Step1: Find the direction for maximum variance (unit vector) and rotate the axis along that direction

Now, coming back to the above scatter plot, if we somehow find out the direction of the maximum variance, which is the green line and we rotate our x-axis to overlay on the green line, then the pink line which is orthogonal (perpendicular) to the green line will be equivalent to our existing y-axis. So, instead of feature x and y , now we have PC1 (Green line) and PC2(Pink Line). We can now drop PC2, since the spread and all the information is contained in PC1 itself. Do you see what we did? We exchanged two features with two PCs and then dropped one of the PCs.

Step 2: Project all the points onto our new PC1 unit vector

Note that all the points that we see in the scatter plot above are not just dots but they are actually the head of all the vectors in the space. Now the projection of all these vectors on the PC1 unit vector can be calculated by our dot product formula.

Projection of any xi onto unit vector u (PC1) => xi`= (u.xi) / ǁ u ǁ

Since u is an unit vector, hence ǁ u ǁ = 1, thus, xi` = (u.xi) = (u^T )xi

Step 3: First optimization problem is to find an unit vector to maximize the variance

Variance {(u^T ) xi} for i:1 to n = 1/n Σ ((u^T )xi- (u^T ) x̅)²

Note that our data should already be column standardized for us to even start applying PCA. Given that, the second term “(u^T ) ” becomes 0. Thus our optimization problem boils down to the below form:

Objective: Maximize Variance {xi`} = 1/n Σ ((u^T ) xi)²

Constraint: Such that (u^T ) u=1 (u is an unit vector)

Step 4: Second (or Alternative) optimization problem is to find an unit vector to minimize the sum of all distances of the data points xi from PC1

Given any vector/data point xi, the distance “di” can be calculated using Pythagoras theorem as di² = ǁ xi ǁ²- ((u^T ) xi)² (Note: (u^T ) xi is the projection length)

This can be re-written as :

di² = ǁ xi ǁ² — ((u^T )xi)² = xi.xi — ((u^T )xi)² = (xi^T) xi- ((u^T )xi)²

Objective: Minimize Σ ((xi^T) xi- (uT xi))²

Constraint: Such that (u^T )u=1 (u is an unit vector)

Step 5: Solution for the optimization problem

How do we find out the unit vector to maximize the variation? Well, do you remember, discussing about eigenvectors? Yes, finding out the eigenvectors sorted in a decreasing order with the greatest eigenvalue at the first will basically give us the direction of maximum variance.

Av = λv

Here A is our covariance matrix, which is always a symmetric matrix [dXd], where d is number of features. Recall the equation that we just learnt about :

Thus, COV(A) [dXd] =( A^T) A [A Transpose is d X n & A is n X d, where n is the number of rows and d is the number of features]

Now, Eigenvectors of COV(A) = V1, V2, …….Vd

Eigenvalues of COV(A) = λ1, λ2,λ3……..λd, such that λ1 ≥ λ2 ≥ λ3……..≥λd

So, basically V1 is our unit vector u (PC1), that we have been looking for. This is the case when we are just looking for one Principal component PC1. But in real scenarios you might be looking for n number of Principal components. For instance, our data matrix A has 100 features and we want to retain only 50% or less features. In that case, what we need to do is to find out say,[(λ1+ λ2+λ3…….. + λ52) / Σλi ] = 0.99. Thus, in this example, if we preserve first 52 Principal Components and drop the rest, we are preserve 99% of information.

--

--