Understanding Principle Component Analysis(PCA) step by step.

Gursewak Singh
Analytics Vidhya
Published in
4 min readJan 7, 2020
Photo by NordWood Themes on Unsplash

Introduction

Principal component analysis (PCA) is a statistical procedure that is used to reduce the dimensionality. It uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. It is often used as a dimensionality reduction technique.

Steps Involved in the PCA

Step 1: Standardize the dataset.

Step 2: Calculate the covariance matrix for the features in the dataset.

Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

Step 4: Sort eigenvalues and their corresponding eigenvectors.

Step 5: Pick k eigenvalues and form a matrix of eigenvectors.

Step 6: Transform the original matrix.

Let's go to each step one by one.

1. Standardize the Dataset

Assume we have the below dataset which has 4 features and a total of 5 training examples.

Dataset matrix

First, we need to standardize the dataset and for that, we need to calculate the mean and standard deviation for each feature.

Standardization formula
Mean and standard deviation before standardization

After applying the formula for each feature in the dataset is transformed as below:

Standardized Dataset

2. Calculate the covariance matrix for the whole dataset

The formula to calculate the covariance matrix:

Covariance Formula

the covariance matrix for the given dataset will be calculated as below

Since we have standardized the dataset, so the mean for each feature is 0 and the standard deviation is 1.

var(f1) = ((-1.0-0)² + (0.33-0)² + (-1.0-0)² +(0.33–0)² +(1.33–0)²)/5
var (f1) = 0.8

cov(f1,f2) =
((-1.0–0)*(-0.632456-0) +
(0.33–0)*(1.264911-0) +
(-1.0–0)* (0.632456-0)+
(0.33–0)*(0.000000 -0)+
(1.33–0)*(-1.264911–0))/5
cov(f1,f2 = -0.25298

In the similar way be can calculate the other covariances and which will result in the below covariance matrix

covariance matrix (population formula)

3. Calculate eigenvalues and eigen vectors.

An eigenvector is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled.

Let A be a square matrix (in our case the covariance matrix), ν a vector and λ a scalar that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.
Rearranging the above equation,

Aν-λν =0 ; (A-λI)ν = 0

Since we have already know ν is a non- zero vector, only way this equation can be equal to zero, if

det(A-λI) = 0

A-λI = 0

Solving the above equation = 0

λ = 2.51579324 , 1.0652885 , 0.39388704 , 0.02503121

Eigenvectors:

Solving the (A-λI)ν = 0 equation for ν vector with different λ values:

For λ = 2.51579324, solving the above equation using Cramer's rule, the values for v vector are

v1 = 0.16195986
v2 = -0.52404813
v3 = -0.58589647
v4 = -0.59654663

Going by the same approach, we can calculate the eigen vectors for the other eigen values. We can from a matrix using the eigen vectors.

eigenvectors(4 * 4 matrix)

4. Sort eigenvalues and their corresponding eigenvectors.

Since eigenvalues are already sorted in this case so no need to sort them again.

5. Pick k eigenvalues and form a matrix of eigenvectors

If we choose the top 2 eigenvectors, the matrix will look like this:

Top 2 eigenvectors(4*2 matrix)

6. Transform the original matrix.

Feature matrix * top k eigenvectors = Transformed Data

Data Transformation

Compare with sklearn library

code snippet for PCA using Sklearn

The results are the same, only change in the direction of the PC1, which according to me, doesn’t make any difference as mentioned here also. So we have successfully converted our data from 4 dimensional to 2 dimensional. PCA is mostly useful when data features are highly correlated.

Since this is my first blog, I am open to suggestions and please do check out the code version of above on my GitHub.

--

--