Basic intuition of PCA

Published in

Nerd For Tech

5 min readJul 23, 2021

Whenever we come across Machine learning models involving images or have to deal with vectors of a complex dimensionality, computation becomes a barrier in getting timely results. It is therefore intuitive to use algorithms that reduce computation complexity involving vectors and help in getting timely and better results. One such method to do so is dimensionality reduction and we will talk about one of the most important techniques of dimensionality reduction today - Principal Component Analysis or PCA.

An image on intuition of PCA from 365 Data Science

Dimensionality Reduction and PCA

Suppose we are working on a dataset containing images. Our aim is to train a model that classifies the images into two or more categories. So what are these images in the dataset essentially? They are vectors of higher dimensions that need to be worked upon, causing a lot of computation effort and delay. A good example would be a model working on human genes, which usually includes huge amount of multidimensional data. Reducing the dimensions can not only save time and resources, but can also help in doing away with unnecessary complexity.

Just imagine a sheet of paper with some points etched on it that is folded in some manner such as it takes up a 3D space. Now, to refer any point on that sheet of paper, we’d need 3 points of directions, owing to its 3D shape. Now, if we unfold the sheet such that it now takes a 2D space, then how many directions would we need to locate a point? Two of course!!

That’s exactly what PCA algorithm does. It helps us project a data of higher dimensions into a space of lower dimension, so that the dimensions to be involved in the computation are reduced, and the accuracy isn’t affected much. It is an unsupervised learning algorithm, mainly used when complexity can be reduced at the cost of a little accuracy loss that is acceptable. Without going into a formal definition, the main aim of PCA at the end is to project the data of dimension d into a space of dimension p such that p < d. We do so by finding principal components of the data. Principal axes can be defined as axes along which the data points, when projected, ensure maximum variance. By this definition, the first principal axis would be that axis that ensures maximum variance of projected points, the second principal axis would be that axis that ensures second maximum variance of projected points, and so on. One interesting feature of these principal axes are that they are mutually perpendicular to each other.

Theory

Let us assume that that we have a matrix X = [x₁ x₂ x₃ x₄ … ] up to n points, where each point in X is a vector of dimension d, thus making X a d⨉n matrix. Now our aim is to project the matrix into an axis such that the variance is maximum. Let us assume U₁ to be a vector of dimension d⨉1 that is supposed to be one of our principal axis. So the projected values along U₁ would be U₁ᵀX, where U₁ᵀ is transpose of U₁, with dimensions 1⨉d.

So our aim is to maximize Variance(U₁ᵀX). Here, it is covariance instead of variance as vectors are involved.
We know that one form of finding covariance of a matrix X is by finding XXᵀ.
So,
Variance(U₁ᵀX) = (U₁ᵀX)(U₁ᵀX)ᵀ
⇒ U₁ᵀ(XXᵀ)U₁
⇒ U₁ᵀSU₁, where S is the covariance matrix of X, with dimension d⨉d

We cannot maximize this quadratic function U₁ᵀSU₁, since it has no upper bound. So, instead, we will introduce another constraint i.e. U₁ᵀU₁=1. The reason we did so is because our main aim is to find the direction of the vector, and length can thus be bounded.

Now, we will use the lagrangian form to solve the equation-
L(U₁,U₁ᵀ) = (U₁ᵀSU₁) - λ₁(U₁ᵀU₁-1)
To obtain the maximum/saddle point, we partially differentiate this w.r.t. U₁ and equate to 0, and, sidelining the complex calculations, get the following equation -
2SU₁ - 2λ₁U₁ = 0
⇒ SU₁ = λ₁U₁ which is the representation of the eigenvalue and eigenvector pairs. This concludes that the principle components are nothing but the eigenvectors of S. Also, we can see that U₁ᵀSU₁ = U₁ᵀλ₁U₁ or λ₁U₁ᵀU₁ or simply λ₁, which is a scaler.
So, with this notion, the largest λ of all sets of eigenvalues corresponds to the maximum variance or the first principal component, by definition. Thus we need to find these eigenvalues, sort them in descending order, and we will obtain our Principal components.

Method

We achieve this through SVD or single value decomposition method.
Let us first convert our data points x₁, x₂… into x₁-⨱, x₂-⨱ .. so on, where ⨱ is the mean of all the points in the respective column. This is done to normalize the data in the respective column. Let X` represent the new matrix thus formed.

According to SVD, any matrix X can be represented as U∑Vᵀ, where U is the eigenvector of XXᵀ and V is the eigenvector of XᵀX, and ∑ is a diagonal matrix.
So, in our case, SVD of X` would yield U∑Vᵀ, where U is the eigenvector of X`X`ᵀ. Now, using the definition of covariance, we know that X`X`ᵀ is nothing but the covariance matrix of X. And by definition, the eigenvectors of the covariance matrix are by default the principal components, as seen above. So, this is how we get the principal components of a matrix.

Algorithm

Convert X to X` by subtracting mean from points( centering the data).
Perform SVD on X`.
U thus formed, contains the principal components of X

It is worth mentioning that each column of U represents a principal component, i.e. first column is the first principal component, second column is the second principal component and so on..

4. Use the first p columns of U as the first p principle components for the matrix. The matrix formed would be of order p⨉d. Lets name it G for now.

5. So, GX would yield a p⨉n matrix from a d⨉n matrix which is obviously of a lower dimension, since p < d.

This is how a matrix is projected into a lower dimension through PCA.

Other uses

Many times, there is some noise in the photograph to be used as a dataset, that needs to be removed for accurate results. For this, we can eliminate the lower Principal components through PCA and re-project the matrix into its original dimensions. This can be done as follows-

let U= d⨉p matrix to be used for finding the first p principal components from X.
So, y = UᵀX is of dimension p⨉n
Now, to obtain the new reverse projected matrix Xnew, we just multiply y with U.
Xnew = Uy, where Xnew is of dimensions d⨉n again, but with different values, as it is reverse projected with only first p PCs out of the total d PCs.

That’s all for today!! Hope you liked it.