Principal Component Analysis(PCA) — Detail Explanation

Published in

Analytics Vidhya

8 min readNov 26, 2019

Introduction:

PCA is the basic dimensionality reduction technique, which helps us to reduce the dimensions of dataset and exclude unwanted features. PCA is important when dataset have hundreds of dimensions, we can use it to extract features having more information. As we cannot visualize more than 3 dimensions, we can use PCA to reduce the dimensions to visualize the data.

In this blog we will be covering below topics -

Basics of PCA.
High level Maths.
Implementation using Python.

What is Principal Component ?

PCA is a mathematical tool which helps to reduce dimensions/correlated variables/features of dataset from huge number to lesser uncorrelated variables. In this process, the features which get discarded are the one which do not explain data enough and keep the features which explain data more and hence called as Principal Component.

How PCA reduces the dimensions?

Lets see some examples how PCA decides which features to keep,

Example 1-Assume there is 2D dataset X having 2 features (F1,F2) and we want to make it 1D.

In above image ‘x’ represents the data points plotted in 2D plot. F1, F2 are the features in dataset(X). Orange selection shows the spread of data points on F1. Red selection shows the spread of data points on F2.

We can observe that spread/variance of data points on F1 is more than spread/variance of data points on F2, that means F1 explains the dataset X more than F2.

Our task is to make given dataset 1D, so the feature we can exclude is F2 as it has less information about dataset.

Example 2 — Assume that the dataset (X) has 2 feature(F1, F2) and we have to reduce the dimension of given dataset to 1D.

The spread of data points is shown in below plot,

In above image x’ represents the data points are plotted in 2D plot. F1, F2 are the features in dataset(X).

In this example the spread of data on both the features is merely same. Here PCA finds the new direction F2' on which the data points have more spread as shown in below diagram,

Now F2' has more spread/variance than F1'. F1' is shifted by angle Ѳ (Theta) from F1 and F2' is shifted by same angle Ѳ (Theta) from F2.

As we have F1' with less spread and F2' with more spread, we can exclude F1' and project all the data points on F2' and use F2' to visualize data in 1D.

By above example we can say that,
1. We are rotating axis to find F2' with maximal variance.
2. Drop the F1'.

Mathematical Representation of PCA -

We do not need the F2' as whole, the direction is sufficient to plot and project all the data points on it. So lets call F2' as u1 which is unit vector following the direction of F2'.

So the u1 is the direction on which if we project each of point the Spread/variance will be maximal and the length of it is 1 i.e.

||u1|| = 1

In above diagram we have plotted one of data point called xi and have projected it on our newly found direction u1 as xi ’.

as length of unit vector u1 (||u1||) is 1 we have taken only representation of dot product of u1.xi , Now for every point xi we can calculate xi ’ by using u1.

If we take the mean of Xi where i varies from 1 to n as x̄ and multiply it with u1 transpose we will get the mean vector of xi ’ as x̄ ’ as below

Q1 — Find u1 such that the variance of points xi projected onto u1 is maximal i.e.

Now according to definition, projection of xi can be written as below,

Our dataset X is column standardized , i.e. the mean vector for X which is x̄ will be zero. So our equation will become

Alternative Formulation of PCA — Distance Minimization

Unlike previous problem, in distance minimization PCA finds the u1 on which the distance of all the projected points is less.

When we achieve the maximal variance on particular direction, same direction is the one which has minimal distance of all data points.

di =distance of point.

Q2- Find u1 which minimizes the sum of distances squared.

Let’s plot one point and try to find the distance of it from u1. In below diagram, we have taken xi as a point.

Here, the diagrams forms right angle triangle, using Pythagoras theorem we can calculate the distance di of each point.

using this data now we can find u1 on which the sum of squared distance of all point is minimal.

we have put the value of di square in the above formula.

Finding u1

Now we will find the solution for both the above optimization problems we discussed, that means we will try to find the direction u1,

1. Let’s take one dataset X which is n x d matrix and column standardize.

2. Calculate the co-variance matrix for X and call it S, which will be d x d matrix.

3. Find eigen values for S which are λ1, λ2, λ3,…., λd

Lets assume that λ1 >= λ2 >= λ3 >=…., >= λd

That means λ1 is a maximal eigen value.

Computing eigen values and vectors is very simple, as in python there is function eigen in Numpy library which gives you eigen values.

4. For every eigen value there is corresponding eigen vector,calculate the eigen vectors v1, v2, v3,….,vd.

There is very nice property eigen vectors hold is, every pair of eigen vector is perpendicular to each other.

5. The largest eigen vector is the u1 i.e.,

u1 = v1

Observations

1. V1 is the vector which has maximal spread, V2 corresponds to the second most spread direction and so on.

2. λi/summation(λi) represents what percentage of information/spread we are going to preserve by selecting the Vi vector.

Real Time example of Dimensionality Reduction using PCA:

We will take the MNIST dataset from Kaggle.com and try to visualize it in 2D.

About MNIST — It is dataset for images of handwritten numbers. Each image is of 28 x 28. We have 4200 data points in this dataset, from which will be realizing 15000 data points. We will be using python library named sklearn to reduce dimensions of MNIST dataset from 784 dimensions to 2 dimensions.

To apply PCA on MNIST the Python code goes as below,

#First read the csv(comma separated file) containing MNIST dataset.
import pandas as pd
import numpy as np# save the labels to a Pandas series 
targetd0 = pd.read_csv(‘./mnist_train.csv’)
l = d0['label']# Drop the label feature
d = d0.drop("label",axis=1)

Labels in this data set are the actual numbers written in the image. For each data point label has been given.

# Pick first 15K data-points to work on for time-efficiency.
labels = l.head(15000)
data = d.head(15000)

We have taken 15000 points from total 42000 and stored in variable data.

from sklearn.preprocessing import StandardScalerstandardized_data = StandardScaler().fit_transform(data)#find the co-variance matrix which is : A^T * A
sample_data = standardized_data#matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data)#finding the top two eigen-values and corresponding eigen-vectors
#for projecting onto a 2-Dim space.
from scipy.linalg import eigh#the parameter ‘eigvals’ is defined (low value to high value)
#eigh function will return the eigen values in ascending order
#this code generates only the top 2 (782 and 783) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782,783))#projecting the original data sample on the plane
#formed by two principal eigen vectors by vector-vector #multiplication.
import matplotlib.pyplot as plt
new_coordinates = np.matmul(vectors, sample_data.T)print(“ resultanat new data points’ shape “, vectors.shape, “*”, sample_data.T.shape,” = “, new_coordinates.shape)#appending label to the 2d projected data
new_coordinates = np.vstack((new_coordinates, labels)).T#creating a new data frame for plotting the labeled points.
dataframe = pd.DataFrame(data=new_coordinates, columns=(“1st_principal”, “2nd_principal”, “label”))print(dataframe.head())#plotting the 2d data points with seaborn
import seaborn as sn
sn.FacetGrid(dataframe, hue=”label”, size=6).map(plt.scatter, ‘1st_principal’, ‘2nd_principal’).add_legend()plt.show()

In above example we are computing PCA by calculating all required terms like Co-variance, Eigen value, Eigen Vector, etc. So, to bypass all these step we have library in python called sklearn which we will learn now.

Implementation of PCA using sklearn:

#initializing PCA
from sklearn import decomposition
pca = decomposition.PCA()#set number of parameter as 2
pca.n_components = 2
pca_data = pca.fit_transform(sample_data)#pca_reduced will contain the 2-d projects of simple data
print(“shape of pca_reduced.shape = “, pca_data.shape)#attaching the label for each 2-d data point
pca_data = np.vstack((pca_data.T, labels)).T#creating a new data from which help us in plotting the result data
pca_df = pd.DataFrame(data=pca_data, columns=(“1st_principal”, “2nd_principal”, “label”))sn.FacetGrid(pca_df, hue=”label”, size=6).map(plt.scatter, ‘1st_principal’, ‘2nd_principal’).add_legend()
plt.show()

sklearn is a very useful library in python which reduces all our above steps to single line. In sklearn, there is module called decomposition which has PCA function which simply apply PCA on given data set without any need of computing any Eigen value and Eigen Vector separately as previous code. So, above we are creating new variable in which the PCA will be stored . Using that variable we will reduce dimensions to 2 and all computation of eigen value, eigen vector, co-variance will be computing in single function fit_transform(sample_data) and then we will check for the shape of our data and lastly we will be putting it in dataframe and plot it. The above graph shows the output of our PCA which is same as previous one, it is only rotated by 90 degree.

Limitations of PCA -

It will not be applicable for circular data spread as it loses lot of information.
For well separated cluster in distinct quadrant it may also lose lots of information.
If data is like sine wave then also PCA loses data.

Acknowledgement -

Thanks to Applied AI course and team for teaching the concept.