A Beginner’s Guide for Dimensionality Reduction using Principal Component Analysis(PCA).

Published in

Analytics Vidhya

9 min readNov 1, 2019

Visualize your data before throwing it into your Models.

Visualize your data before throwing it in your models — Photo by Victor on Unsplash

I have a question for you before diving into the topic at hand. Here’s the question — “Have you ever thought why are you even learning some of the mathematical concepts like the equation of line, planes, eigen values, eigen vectors etc or what are the real-world applications when you first learnt them in your school or college?”.

To be frank I too didn’t know at that time but I promise you that we will solve this mystery together :)

Out of all the possibilities to explain these topics, today I’ll try to explain them in a Data Science | Machine learning perspective. Let’s take a look at the table of contents:

Why PCA?
Geometric intuition of PCA
Mathematical Objective function of PCA(Variance Maximization)
Alternative Formulation of PCA(Distance Minimization)
Eigen Values and Eigen Vectors
Code for PCA
Limitations of PCA
Other uses of PCA

Prerequisite: I’m assuming that you know some basics like what is a dataset, what are features in a dataset and some high school math. Nothing much!.

Why PCA?

Assume you are given a very high dimensional dataset like MNIST dataset where there are 784 dimensions. What if I say “Visualize this data”.

Of course, we humans can’t visualize more than 3 dimensions. This is where PCA comes into play.

Apart from Visualization, there are other uses of PCA, which we will see as we continue to learn this concept.

Geometric Intuition

For explanation purpose, let's take a 2D dataset instead of 784. Let's say we have a dataset with two features f1 and f2. f1 represents Blackness of hair and f2 represents the height( i know it doesn’t make sense but please bear with me!)

This data is in 2-Dimensions because we are having 2 features. As you can see more data points are spread across feature f2 than f1.

The mathematical term for spread is called variance

If we are forced to go from 2D to 1D, we can project these points on the feature f2 and simply say this is the 1D representation of the data since spread/variance is more on f2. This is simple right.

Let’s take a slightly tricky data this time:

As you can see, in such a case both f1 and f2 preserve the same spread/variance.

If you choose either f1 or f2, there will be 50% of the information lost. In this case, what would you do?

One idea is that we rotate the axis’s f1 and f2 as f1' and f2' respectively, where f1' ⊥ f2', such that f1' has maximum spread than f2' (See the image below):

As we found out the axis which has the maximum spread f1'. Now we can drop f2' and project Xi’s(datapoints) completely onto f1'.Our objective is achieved now i.e going from 2 dimensions to 1 dimension.

This is what essentially PCA does. It tries to find the right direction in which we can preserve more information about the data and dropping unnecessary dimensions.

Now we’ll see the mathematics behind this great idea:

Mathematical Objective function of PCA

Notations and prerequisite formulas:

We’ll represent f1' i.e the maximum variance axis as “u1” as most of the explanations online would prefer this notation.
So, we want to find the unit vector u1 i.e ||u1||=1 which preserves maximum variance.
Projection of Xi’s on u1 = u1^T.Xi (This is what we want to find out.)
We assume that data is Column standardized which means, mean(μ)=0 and Standard deviation( σ)=1.

Coming to the mathematical equation:

Formula for Variance:

Now we’ll relate the above formula with our problem:

We want to find out u1 such that the spread/variance of projected Xi’s onto u1 is maximal.

So the function we want to find is as follows(See the equation on the left).

Since we assumed that data is Column standardized which means, mean=0, the u1^T. x̄ part in the above equation becomes 0.

So, this will be the final optimization function that we will solve where we want to find u1 where the variance is maximal with a constraint that u1 is a unit vector.

Alternative Formulation of PCA:(Distance minimization)

ref: https://stats.stackexchange.com/a/140579

As you can see from the above visual, we have to minimize the distance of points Xi’s to u1 in this distance formulation.

The adjacent distance is nothing but projection of Xi onto u1, the hypotenuse is the length of Xi. Now we can easily find out the distance b/w Xi and u1 i.e di from Pythagoras’ theorem.

Now the final distance optimization function is as follows, where we want to find u1 by minimizing the distance of a point Xi to u1.

The previous optimization function is Variance maximization, while this one is distance minimization.

Eigen Values and Eigen Vectors

I’m not going to explain the definitions here(You have your friend Google for that!), but I’ll explain why these are useful in PCA.

Prerequisite Concept: Covariance matrix.

Solution to our optimization function can be attained by using eigen values and vectors represented by ( λ1, λ2, λ3…., λd) and (V1,V2,V3,….,Vd) respectively. Remember Eigen values are scalars.

There’s a simple function in Sklearn library, wherein you have to give the Covariance matrix(S) to it and it returns you eigen values and vectors correspondingly where λ1>λ2>λ3>….>λd.

That means the vector corresponding to λ1 i.e V1 has the highest variance explained, then λ2 i.e V2 which is the second most variance explained vector and so on.

So the u1 which we are trying to obtain from the optimization function is nothing but V1 here.

One of the very nice properties is that every pair of eigen vectors are perpendicular to each other.

That means if we take V1 and V2 i.e the top two vectors corresponding to top eigen values, it is similar to obtaining f1' and f2'. Similarly, we can mention ‘d’ as 700 or 300 or 100 or 2 or 1 or whatever number of dimensions you want to reduce to and it will return the top dimensions of your dataset.

Okay!! enough of the theory. Let’s dive into the code part.

Code for PCA

Manual implementation:

Importing some useful libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the dataset:

d0 = pd.read_csv(‘./mnist_train.csv’)
l = d0[‘label’]
d = d0.drop(“label”,axis=1)

Checking the shape:

print(d.shape)
print(l.shape)

(42000, 784)
(42000,)

As you can there are 784 dimensions each represents a pixel in the image

Let's see how an image looks like in this dataset:

First, we have to Standardize the data as I said earlier:

from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)

Constructing the Co-variance matrix:

sample_data = standardized_data
covar_matrix = np.matmul(sample_data.T , sample_data)print ( “The shape of variance matrix = “, covar_matrix.shape)The shape of variance matrix =  (784, 784)

Finding the Eigen values and vectors:

from scipy.linalg import eigh
values, vectors = eigh(covar_matrix, eigvals=(782,783))print(“Shape of eigen vectors = “,vectors.shape)
# converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.Tprint(“Updated shape of eigen vectors = “,vectors.shape)
# here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
# here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

Shape of eigen vectors = (784, 2)
Updated shape of eigen vectors = (2, 784)

Projecting the original data sample on the plane formed by two principal eigen vectors by vector-vector multiplication:

import matplotlib.pyplot as plt
new_coordinates = np.matmul(vectors, sample_data.T)print (“ resultant new data points’ shape “, vectors.shape, “X”, sample_data.T.shape,” = “, new_coordinates.shape

Resultant new data points' shape (2, 784) X (784, 15000) = (2, 15000)

# appending label to the 2d projected data
new_coordinates = np.vstack((new_coordinates, labels)).T# creating a new data frame for ploting the labeled points.
dataframe = pd.DataFrame(data=new_coordinates, columns=(“1st_principal”, “2nd_principal”, “label”))
print(dataframe.head())1st_principal  2nd_principal  label
0      -5.558661      -5.043558    1.0
1       6.193635      19.305278    0.0
2      -1.909878      -7.678775    1.0
3       5.525748      -0.464845    4.0
4       6.366527      26.644289    0.0

Plotting:

# ploting the 2d data points with seaborn
import seaborn as sn
sn.FacetGrid(dataframe, hue=”label”, size=6).map(plt.scatter, ‘1st_principal’, ‘2nd_principal’).add_legend()
plt.show()

Sklearn’s implementation:

Instead of writing this much amount of code, Sklearn have a package called decomposition that makes our task simpler:

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()# configuring the parameteres
# the number of components = 2
pca.n_components = 2
pca_data = pca.fit_transform(sample_data)# pca_reduced will contain the 2-d projects of simple data
print(“shape of pca_reduced.shape = “, pca_data.shape)

shape of pca_reduced.shape = (15000, 2)

Plotting:

# attaching the label for each 2-d data point 
pca_data = np.vstack((pca_data.T, labels)).T# creating a new data fram which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=(“1st_principal”, “2nd_principal”, “label”))
sn.FacetGrid(pca_df, hue=”label”, size=6).map(plt.scatter, ‘1st_principal’, ‘2nd_principal’).add_legend()
plt.show()

Variance Explained vs Number of dimensions:

As you can see 90% of the variance is explained or retained just by using 200 features/dimensions. So instead of using all the 784 dimensions for modeling, you can take 200 features or better even if you take 400 dimensions you could retain 99% of the information.

Limitations of PCA

Some of the Failure cases are:

PCA is focused on finding orthogonal projections of the dataset that contains the highest variance possible in order to ‘find hidden LINEAR correlations’ between variables of the dataset.

But if your features are not linear, say, they are spiral or other shapes, then PCA is not your best choice.

2. If your dataset follows a nice structure like Sine wave, if you project that onto V1, we are loosing that important information about the structure which may be useful in Machine learning tasks like feature engineering.

Other Uses of PCA

Reduce size: When we have too much data and we are going to use process-intensive algorithms like Random Forest, XGBoost on the data, so we need to get rid of redundancy.
A different perspective: Sometimes a change of perspective matters more than reduction.

If you find any mistakes in this blog, please feel free to discuss in the comment box.

I will be posting more and more content on explaining various complex topics in a simpler manner. Until then goodBye :)

Follow me on Linkedin: https://www.linkedin.com/in/dileep-teja-473088141/

A Beginner’s Guide for Dimensionality Reduction using Principal Component Analysis(PCA).

Written by Dileep Teja