Linear Dimensionality Reduction — PCA
Table of Contents
1. Introduction
2. Mathematical derivation
3. Conclusion
previously variety of DRTs were briefly introduced, in this chapter we will go through mathematical working details on one of Linear Dimensionality Reduction Technique and probably most popular one called Principal Component Analysis.
Usually I tend work in bottom-up fashion w.r.t. simplicity of machine learning project therefore whether it be model itself or preprocessing step I start with linear functions unless I have information about underlying data structure beforehand.
Introduction
Goal of PCA is to find orthogonal bases(a.k.a Principal Component/Eigenvector) that preserve largest variance of original data. It is feature extraction technique where instead of selecting subset of features to reduce dimensions it projects data into different bases.
Orthogonality simply refers to any vectors that are 90 degrees to each other.
Bases: basis of a vector space is set of linearly independent vectors that span full space. In other words when you remove one of the vector and it reduces dimension than that vector is basis vector. In Euclidean space unit vector pointing up and to the right are basis vector since by doing linear transformation we can reach any point in 2-d Euclidean space and removing one of basis vector reduce dimension to 1-d.
For demonstration purposes we will create very simple 2-d dataset. Keep in mind dimensionality reduction techniques do not care about y-values(target)
In order to find maximal variance preserving orthogonal bases we will follow these steps:
- Shift center of mass to the origin.
- Compute Covariance Matrix
- Compute Eigenvectors and Eigenvalues.
- Plot Eigenvectors(Principal Components)
Mathematical Derivation
Step 1 — Shift center of mass to origin
Notice that center of our plot has been shifted to origin. The reason for doing this is because this will makes our life easier when computing covariance matrix.
By centering center to origin notice that x_bar will become zero simplifying the equation to:
Step 2 — Compute Covariance Matrix
- Covariance matrix should be dxd matrix where d is number of dimensions.
- Each element represents covariance between two features (whether or not they are positively correlated or negatively correlated, note since it is not correlation number higher number does not mean greater correlation).
- Diagonal elements represents Variance of each feature.
Usually real world data should look like below, containing m dimensions however in our example keep in mind that m = 2.
representing our x1,x2 as matrix:
and multiplying transpose X by X we get:
which can be simplified like this
We can see that on the diagonal are formula for Variance (when x_bar = 0). and all other elements corresponds to covariance of x1,x2 (when their mean =0).
So we’ve calculated covariance matrix and proved that diagonal elements represent variance of features and other elements represent covariance between features.
Step 3 — Computing Eigenvector and Eigenvalues
Eigenvectors determine directions of new projected space and eigenvalues determine their magnitudes which means they explain variance of data along projected spaces.
After calculation is finished, sort Eigenvector in decreasing order w.r.t. its eigenvalues. Higher eigenvalue bear most variance of original data when data has been projected to corresponding eigenvector and lowest eigenvalue bear least variance(ones we should drop).
[(array([-0.31091055, -0.95043918]), 79.42349759589935),
(array([-0.95043918, 0.31091055]), 0.30812852991522277)]
And if you want to know how much variance does projecting onto 1st eigenvector(PC1) preserves (a.k.a explained variance):
So in our case, PC1 will preserve 79.42/(79.42+0.31) = 0.996, 99.6% of variance.
There is no rule to how many PCs you must choose therefore common practice if plot elbow plot and pick k that corresponds to elbow in the plot. Another option is selecting k until it preserves amount of variance you wish to preserve. This can be easily implemented in sklearn.PCA, instead of giving integer values, give value between 0~1 ex: PCA(n=0.95) means keep features that preserve 95% of variance.
Step 4— Plot Eigenvectors(Principal Components)
projection matrix is simply k chosen eigenvectors. Let’s plot PC1 and PC2 to see how well it will preserve when we project data onto it.
Black line represents principal components. Larger one is PC1 which seems to go through right down the linear line, if data gets projected onto it we can see that most of variance will be preserved. So we could project all points to PC1 and remove PC2 therefore we’ve just reduce our dimension from 2-d to 1-d!
Conclusion
Main objective of this tutorial was to gain mathematical intuition about PCA, how it works in the background since understanding how it operate behind the scenes will give you more insight when debugging your ML model. 2-D data were used for simplicity however same concept applies to any number of dimensions. Next tutorial will cover why PCA is not suitable for non-linearly separable datasets and introduce alternative non-linear dimensionality technique called ISOMAP.
Thanks for reading and please comment if there is any incorrect information, I would love to fix my misunderstandings :)
References:
- Span, basis vector — 3blue1brown
- Covariance Matrix — Ben Lambert
- Covariance Matrix — Luis Serrano
- The Mathematics Behind PCA — Towards Data Science
- Dimensionality Reduction PCA — Korea University course (In Korean)
- Principal Component Analysis in 3 simple steps — Sebastian Raschka
- Python Data Science handbook PCA — Jake VanderPlas