Dimensionality Reduction and Visualization - Part-2 - PCA

Deepak Jain
Applied Machine Learning
5 min readJun 7, 2020

This article is in continuation of the series of articles on Dimensionality Reduction and Visualization. In case you haven’t gone through the part-1, I would highly recommend you first go through that here.

In this article, we will understand one such dimensionality reduction technique called Principal Component Analysis (PCA).

Remember the objective of dimensionality reduction is to find a low-dimensional representation of the data that retains as much information as possible.

Before diving into the mathematical representation of PCA, let's understand the geometric intuition behind it.

Consider the following example:

Imagine we collect data from people. Now, there are 2 features that we collect from various people. One is the height of people and the other is the blackness of their hair. Now, suppose we plot these 2 features using a scatter plot and we get a graph something like below:

We want to perform dimensionality reduction on this data set. Meaning, we want to reduce d-dimensional data (2-D in this case) to d’-dimensional data (1-D in this case) where d’<d. So basically, we want to convert this 2-D representation of data into 1-D. We can achieve this by getting rid of one of the features. But the point of the argument is, we have 2 features (f1- height of people, f2-blackness of hair). Now, how do you decide which feature to keep and which to get rid of?

Observe the scatter plot. If I were to draw a perpendicular line from the data points to the axes (red lines), it is very clear that the spread of the points (aka variance) is high on f1 as compared to f2. Now, high variance implies varied information on data and since the spread on f2 is low, it contains less information to lose compared to f1 where the spread is high.

So, in order to convert this 2-D data to 1-D, we drop feature f2 and use feature f1.

This is essentially Dimensionality Reduction: finding the best low dimensional representation of the data. Of course, there’ll be some error because we neglected the second feature but this error is kept to a minimum since the feature f2 has very low variance.

The Plot Thickens

What we saw above in the scatter plot is like a perfect data set created for explaining this concept but in real-world we hardly encounter such perfect data sets.

Consider the below example:

Let’s say we performed column standardization which implies the mean value of the columns is zero 0 and the standard deviation is 1.

Mean[f1] = Mean[f2] = 0 and std_dev[f1] = std_dev[f2] = 1

Suppose, on plotting the scatter plot we get a graph like above.

Now, unlike the first graph, there is enough spread of data on both f1 and f2 meaning both have a pretty good amount of variance. So how do we perform dimensionality reduction here? Should we just randomly pick a feature and drop the other? NO

Let's say I draw a line in the direction of maximum variance and call it x1 and I draw another line x2 which is perpendicular to x1. Now I can see the spread of points on x1 is very high whereas on x2 it is low. This is something very similar to what we learned in our hair color vs height example. If we think about this logically, all we did is the rotation of axes by an angle θ where we rotated f1 and f2 axes such that we got maximum spread of points (high variance) on x1.

Next, we want to find a direction of x1 such that the variance of the data points projected on x1 is maximum. To find the direction of x1, we create a unit vector (u1) in the direction of x1. You might wonder why are we using a unit vector? It is because, for a unit vector, its magnitude (scalar component) is 1 and we are mainly interested in the direction part of x1. Hence,

u1: Unit vector that represents the direction on which if we project the data points, we`ll get maximum spread (variance)

||u1|| = length of u1 = 1

Let's consider just one data point for now to understand the calculations. We can later extend the same formulation to the rest of the points.

Imagine we have a graph like below — with just one data point. The rest of the things remain the same as above. In the graph, I have my original axes {f1, f2}, I have a data point xi, I have a unit vector u1 which is in the direction of maximum variance {x1}. I project the point xi on the unit vector can call it point x’i (read as x -i-dash). Here x’i = projection of point xi on u1.

We know that projection of a point can be represented as below:
x’i = (u1.xi) / ||u1||

The English interpretation of the above equation would be:
Projection of xi on u1 = (dot product of u1 and xi ) / (length of u1)

The rest of the deduction is provided in the image (as it is easier to explain it using pen and paper)

This variance formula that we get in the end is the function that needs to be maximized. this function is nothing but the optimization function of PCA.

The English interpretation of the above equation is:

We want to maximize variance of x`i and we want to find u1 which is the direction of maximum variance

In the next part, we will see the alternate formulation of PCA and its limitation.

Feel free to respond in the comments section.

Until then, Happy Learning!

Edit:
Link to Part-3 of the series
Link to Part-4 of the series

--

--