Brief Overview of PCA and Implementation of Same Using Numpy

Published in

The Startup

7 min readSep 15, 2020

This is my first article. I hope you guys like it!

In this module, you will learn about another type of unsupervised machine learning technique — Principal Component Analysis (PCA). PCA is widely used to simplify high-dimensional datasets to lower-dimensional ones.

Table of Contents:

Motivation
What is Principal Component Analysis?
Building Blocks of PCA
Illustration — working of PCA
PCA — Algorithm
Checking it using scikit-learn pca function
But how?
Shortcomings of PCA
Summary

Motivation:

Situation 1: A logistic regression setting where you have a lot of correlated variables (high multicollinearity). How do you handle this?

— One way would be doing a variable selection (step-wise/forward/backward).

— But each time you drop a variable, aren’t you losing some information?

There must be a better way of doing this!

Situation 2: You’re doing EDA on a dataset with n records and p variables. You want to visualize this dataset.

— You could look at pairwise scatter plots.

— You’ll need to look at (p*(p-1)/2) plots

— But even if p = 20, this would mean 190 plots!

Again, there must be a better way of doing this!

What is Principal Component Analysis (PCA)?

PCA is the statistical procedure to correct observations of possible highly correlated variables into Principal Components that are:
1. They’re weighted linear combinations of the original variables.
2. They’re perpendicular / Independent to each other.
3. They capture maximum variance of the data and are ordered.

PCA is an unsupervised technique: there is no ‘Y’ or dependent/response variable.

A very powerful technique, Principal Component Analysis, has several use cases:

Dimensionality reduction.
Data visualization and Exploratory Data Analysis.
Create uncorrelated features/variables that can be an input to a prediction model.
Uncovering latent variables/themes/concepts.
Noise reduction in the dataset.

Building Blocks of PCA:

Basis Vectors: So if you actually have a set of vectors where no matter what other vector you pick, it can always be represented as a linear combination of that set, then it is known as the basis vectors for that data-space or dimension. For example the vectors (2,3) and (3,4) can represent any other vector in 2-D as a linear combination of themselves and hence they’re a set of basis vectors for the 2-D space.
Basis transformation: Basis transformation is the process of converting your information from one set of basis to another. Or, representing your data in new columns different from the original. Often for convenience & efficiency, or just from common sense. E.g. Watching a video recorded on 3-D environment in a 2-D screen is an day-to-day example of basis transformation.
Variance as information: If two variables are very highly correlated, they together don’t add a lot information than they do individually. So you can drop one of them. Variance = Information!

With this, we have covered the 3 building blocks needed to understand PCA.

Illustration — working of PCA

Illustration — finding the principal components

X1, X2 have correlation, but aren’t perfectly correlated.
Objective: to find directions/lines on which the projected data has maximum variance. Or, variance in data points, should be seen in the projections too.
We have several (infinite, actually) options here.

Objective: to find direction/line on which the projected data has maximum variance.

We saw that a purely horizontal or vertical axis will not suffice as neither captures variation in both directions.We therefore need a line that is angled.

One such line is a line drawn in the below diagram. This is the line which is closest to the data.

Projections on this line will retain maximum variation in the original data points.
Note: in fact, PCA can also be considered as finding lines/planes/surfaces closest to data points.

This line is our first Principal Component!

We still have some variance that is left — in the direction perpendicular to our first PC. This is our 2nd principal component!

PCA — Algorithm

In this segment, you’ll get to learn about the algorithm through which PCA works. Originally PCA used the eigendecomposition route in finding the principal components. However, much faster algorithms like SVD have come up which are predominantly used nowadays. However, one thing to note here is that SVD is actually a generalized procedure of eigendecomposition. Therefore, both of them will be having some key similarities.

The steps involved in the eigendecomposition algorithm are as follows:

From the original matrix that you have, you compute its covariance matrix C. (You can read about the covariance matrix here)
After computing the covariance matrix, you do the eigendecomposition and find its respective eigenvalues and eigenvectors
Sort the eigenvectors on the basis of the eigenvalues.
These eigenvectors are the principal components of the original matrix.
The eigenvalues denote the amount of variance explained by the eigenvectors. Higher the eigenvalues, higher is the variance explained by the corresponding eigenvector.
These eigenvectors are orthonormal,i.e. they’re unit vectors and are perpendicular to each other.

Step 1: Initializing array

import pandas as pd
import numpy as np
# Let's take this dataset
a = [[0,0],[1,2],[2,3],[3,6],[4,8],[5,9]]
b = ['X','Y']
dat = pd.DataFrame(a,columns = b)
dat

Step 2: Calculating covariance, eigenvalues and eigenvectors

#Let's create the covariance matrix here.
# An intuitive reason as to why we're doing this is to capture the variance of the entire dataset
C = np.cov(dat.T)
eigenvalues, eigenvectors = np.linalg.eig(C)

Step 3: Sorting values

# Let's sort them now
idx = eigenvalues.argsort()[::-1]
eigenvalues= eigenvalues[idx]
eigenvectors = eigenvectors[:,idx]# Let's check them again
eigenvalues
>> array([16.11868923,  0.04797743])eigenvectors
>> ([[-0.46346747, -0.88611393],[-0.88611393,  0.46346747]])

Step 4: These eigenvectors are the principal components of the original matrix

Note: That the columns in the eigenvector matrix are to be compared with the rows of pca.components_ matrix. Also, if the directions are reversed for the second axis. This wouldn’t make a difference as even though they’re antiparallel, they would represent the same 2-D space. For example, X/Y and X/-Y both cover the entire 2-D plane.

Step 5: Check pca.explained_variance_ratio_, this is almost same to what we have achieved in eigenvalues

Step 6: The dot product of scalarvalues / pca.explained_variance_ratio_ are perpendicular to each other or orthonormal

np.dot(pca.components_[0],pca.components_[1])
>> 0.0

Checking it using scikit-learn pca function

from sklearn.decomposition import PCA
pca = PCA(random_state=42)
pca.fit(dat)
#Let's check the componentspca.components_
>> array([[-0.46346747, -0.88611393],[ 0.88611393, -0.46346747]])# Let's check the variance explained
pca.explained_variance_ratio_
>> array([0.99703232, 0.00296768])

But how?

Because Spectral Theorem exists! Because of this theorem eigendecomposition of the covariance matrix will always:

Yield the eigenvectors which are perpendicular to each other
Have maximum variances allocated to them in an ordered way depending on the magnitude of the eigenvalues.

Spectral Theorem Keypoints : Spectral theorem states that

When you do the eigendecomposition of the covariance matrix, the corresponding eigenvectors would be the principal components of the original matrix.
These eigenvectors would be orthonormal to each other and hence they follow the property that PCs need to be perpendicular to each other.
They would also be in an ordered fashion — the eigenvalues would dictate the variance explained by that principal component and hence ordering the matrix according to the eigenvalues would give us the resultant principal component matrix.
These eigenvectors would also be the linear combinations of the original variables as well.

Shortcomings of PCA:

Below are some important shortcomings of PCA:

PCA is limited to linearity, though we can use non-linear techniques such as t-SNE as well (you can read more about t-SNE in the optional reading material below).
PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use Independent Components Analysis.
PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with class imbalance).

Summary:

Those were some important points to remember while using PCA. To summarize:

Most software packages use SVD to compute the components and assume that the data is scaled and centered, so it is important to do standardization/normalization.
PCA is a linear transformation method and works well in tandem with linear models such as linear regression, logistic regression etc., though it can be used for computational efficiency with non-linear models as well.
It should not be used forcefully to reduce dimensionality (when the features are not correlated).

External links:

My code on Kaggle.
The basic idea of linear dependence/independence of vectors — Khan Academy.
Eigenvectors and eigenvalues — a visual understanding — 3Blue1Brown.
Computing the eigenvalues and eigenvectors — Khan Academy.
You can further explore how PCA, SVD and eigenvectors are related.
Scree Plots.
Laurens van der Maaten’s (creator of t-SNE) website
Visualising data using t-SNE: Journal of Machine Learning Research
How to use t-SNE effectively.
Independent Components Analysis.