Linear Dimensionality Reduction — PCA

Haneul Kim
Analytics Vidhya
Published in
5 min readMay 30, 2021
Photo by Haneul Kim

Table of Contents

1. Introduction

2. Mathematical derivation

3. Conclusion

previously variety of DRTs were briefly introduced, in this chapter we will go through mathematical working details on one of Linear Dimensionality Reduction Technique and probably most popular one called Principal Component Analysis.

Usually I tend work in bottom-up fashion w.r.t. simplicity of machine learning project therefore whether it be model itself or preprocessing step I start with linear functions unless I have information about underlying data structure beforehand.

Introduction

Goal of PCA is to find orthogonal bases(a.k.a Principal Component/Eigenvector) that preserve largest variance of original data. It is feature extraction technique where instead of selecting subset of features to reduce dimensions it projects data into different bases.

Orthogonality simply refers to any vectors that are 90 degrees to each other.

Bases: basis of a vector space is set of linearly independent vectors that span full space. In other words when you remove one of the vector and it reduces dimension than that vector is basis vector. In Euclidean space unit vector pointing up and to the right are basis vector since by doing linear transformation we can reach any point in 2-d Euclidean space and removing one of basis vector reduce dimension to 1-d.

For demonstration purposes we will create very simple 2-d dataset. Keep in mind dimensionality reduction techniques do not care about y-values(target)

above df plotted

In order to find maximal variance preserving orthogonal bases we will follow these steps:

  1. Shift center of mass to the origin.
  2. Compute Covariance Matrix
  3. Compute Eigenvectors and Eigenvalues.
  4. Plot Eigenvectors(Principal Components)

Mathematical Derivation

Step 1 — Shift center of mass to origin

centered at origin

Notice that center of our plot has been shifted to origin. The reason for doing this is because this will makes our life easier when computing covariance matrix.

Covariance matrix formula

By centering center to origin notice that x_bar will become zero simplifying the equation to:

Covariance matrix formula with center at origin

Step 2 — Compute Covariance Matrix

  • Covariance matrix should be dxd matrix where d is number of dimensions.
  • Each element represents covariance between two features (whether or not they are positively correlated or negatively correlated, note since it is not correlation number higher number does not mean greater correlation).
  • Diagonal elements represents Variance of each feature.

Usually real world data should look like below, containing m dimensions however in our example keep in mind that m = 2.

representing our x1,x2 as matrix:

matrix X containing vectors x1, x2

and multiplying transpose X by X we get:

which can be simplified like this

Cov(X)

We can see that on the diagonal are formula for Variance (when x_bar = 0). and all other elements corresponds to covariance of x1,x2 (when their mean =0).

So we’ve calculated covariance matrix and proved that diagonal elements represent variance of features and other elements represent covariance between features.

calculating Cov(X) in python

Step 3 — Computing Eigenvector and Eigenvalues

Eigenvectors determine directions of new projected space and eigenvalues determine their magnitudes which means they explain variance of data along projected spaces.

After calculation is finished, sort Eigenvector in decreasing order w.r.t. its eigenvalues. Higher eigenvalue bear most variance of original data when data has been projected to corresponding eigenvector and lowest eigenvalue bear least variance(ones we should drop).

[(array([-0.31091055, -0.95043918]), 79.42349759589935),
(array([-0.95043918, 0.31091055]), 0.30812852991522277)]

And if you want to know how much variance does projecting onto 1st eigenvector(PC1) preserves (a.k.a explained variance):

So in our case, PC1 will preserve 79.42/(79.42+0.31) = 0.996, 99.6% of variance.

There is no rule to how many PCs you must choose therefore common practice if plot elbow plot and pick k that corresponds to elbow in the plot. Another option is selecting k until it preserves amount of variance you wish to preserve. This can be easily implemented in sklearn.PCA, instead of giving integer values, give value between 0~1 ex: PCA(n=0.95) means keep features that preserve 95% of variance.

Step 4— Plot Eigenvectors(Principal Components)

projection matrix is simply k chosen eigenvectors. Let’s plot PC1 and PC2 to see how well it will preserve when we project data onto it.

Black line represents principal components. Larger one is PC1 which seems to go through right down the linear line, if data gets projected onto it we can see that most of variance will be preserved. So we could project all points to PC1 and remove PC2 therefore we’ve just reduce our dimension from 2-d to 1-d!

Conclusion

Main objective of this tutorial was to gain mathematical intuition about PCA, how it works in the background since understanding how it operate behind the scenes will give you more insight when debugging your ML model. 2-D data were used for simplicity however same concept applies to any number of dimensions. Next tutorial will cover why PCA is not suitable for non-linearly separable datasets and introduce alternative non-linear dimensionality technique called ISOMAP.

Thanks for reading and please comment if there is any incorrect information, I would love to fix my misunderstandings :)

--

--

Haneul Kim
Analytics Vidhya

Data Scientist passionate about helping the environment.