A Beginner’s guide to Principal Component Analysis

Anuj Shrivastav
Analytics Vidhya
Published in
5 min readFeb 28, 2020

YOU CAN CHANGE THE WAY TO LOOK AT THE WORLD!

Well that’s true, if the data means the world to you.

Hi, in this post, we are going to look at one of the beautiful topics of Machine Learning, which is used to reduce the dimensions of our data known as Principal Component Analysis (abbreviated as PCA).

Before beginning, let’s first understand why do we need to reduce the dimensions of our data?

1.Curse of Dimensionality

2. Data Visualization :- Obviously we can’t visualize data having dimensions greater than 3 , and hence we need some way to represent our data , if not, just the important features in 2D or 3D (or any other required number of dimensions) so that we can get the idea of how our data is distributed.

Let’s see the core idea that is involved in PCA:

[Source: My PC]

If I were to ask you to convert this 2D data into 1D data , which feature would you select?

You would obviously select f2= height , Why?

You can see that blackness of hair doesn’t have much spread/information in it. So even removing it and representing all points just in terms of height would suffice.

[Source: Google Images]

Now if I give you the following data:

[Source: my PC]

Can you come up with a feature now?

Both of the features f1 and f2 have nearly equal spread , so we come up with new features as :

[Source: My PC]

Now , it has become similar to previous problem.

Feature f1’ has a lot of spread/information while f2’ doesn’t . So we select only f1’ and drop f2’ .

Core idea : Select / Extract those features along which we get high variance and drop those along which we get less variance .

Steps to reduce 2D data to 1D data:

· Find f₁’ and f₂’ such that f₁’ has maximum variance

· Drop f₂’

· Project all points xᵢ’s onto f₁’

This can be genenralized to reduce our data from d dimensions to d’ (d’ <d) , but for the sake of simplicity , let’s just focus on reducing 2D data to 1D data.

Before moving forward to mathematics behind PCA , make sure you are well-versed with the following topics:-

· Eigen Values and Eigen Vectors

· Covariance Matrix

· Solving Constraint Optimization Problems

Remember what we need to do : We need to find out û (unit vector) along f₁’ and project all xᵢ on it to get corresponding xᵢ’ . û should be selected such that variance of xᵢ’ is maximal.

Projection of a point x₁ on û = x₁ . û

Projecting all points ,

We get

xᵢ’ = ûᵀ . xᵢ

also , x̅ ’ = ûᵀ . x̅

i.e.

mean(xᵢ’) = ûᵀ . mean(x ᵢ)

find û such that variance of {projection of x ᵢ}ᵢ ₌ ₁ ₜₒ ₙ is maximal

or max ᵤ Var { ûᵀ. x ᵢ }

or

After column standardizing our Data ,

=

therefore ,

where,

Subject to constraint that

ûᵀ. û= 1 (Since û is a unit vector)

Solving Constraint Optimization Problem (using Lagrange Multipliers)

L (u,λ) = ûᵀ . S . û — λ ((ûᵀ . û) -1)

Partial differentiation w.r.t u, we get

Which would lead us to:

S.u = λ .u

Does this equation remind you of something?

[Source: Google Images]

Well , this is the equation to find out Eigen values and Eigen vectors of a matrix (in this case , Covariance Matrix S)

BAM !!! The whole problem of PCA boils down to finding Eigen values and Eigen vectors of Covariance Matrix of our Data after standardization.

I’ll leave the part of finding eigen values and eigen vectors up to you.

Remember , for a square matrix of dimension k , we’ll get k eigen vectors which are orthogonal to each other ?

[Source: Google Images]

This means we’ll get our new set of features by calculating eigen vectors of covariance matrix of X , but number of eigen vectors obtained would be equal to the dimensions of our data.

Then , how can we reduce the dimensions of our data?

For that, we need to know what λ(eigen values) signifies.

λᵢ tells about the amount of information along vector uᵢ .

Let’s take some examples:

[Source: My PC]

In this example , fraction of variance/ information preserved along u₁ would be

Let’s take another case

[Source: My PC]

fraction of variance/ information preserved along u₁ would be

Which means , 60% of variance/information can be retained if we drop u₂’

Generally we are required to convert our data from d dimensions to d’ dimensions (d’ < d) , or reduce dimensions such that x% of variance/information is preserved .

How do we go about that?

The answer is simple.

· We’ll first calculate eigen values and eigen vectors of covariance matrix of X after standardization

· Sort them.

· Chose top d’ eigen values and project all points onto the corresponding eigen vectors

· If x% is given , select top p eigen values such that

And project all points to corresponding p eigen vectors.

Woah! You now know all the basics of Principal Component Analysis

[Source: Google Images]

--

--