Geometry makes PCA and T-SNE So Easy!

Nishesh Gogia
Analytics Vidhya
Published in
5 min readJan 1, 2020

In this article, we gonna see How things get easy when we wear spectacles of geometry to understand PCA and T-SNE for Dimension Reduction.

WHAT IS DIMENSION REDUCTION???

We have to know the basic first, RIGHT!! Otherwise Machine Learning has the ability to kick us badly.

I have been kicked many times.

Dimension Reduction simply means reducing dimensions or features in Machine Learning language so that we can get a more interpretable model.

Let me give you an example, imagine we have 784 features, usually people who have dealt with MNIST data set will understand why I have taken 784 features.

If you don’t understand, its ok, imagine any big number and try to think each features as one dimension.

Can you think 784 dimensions in your mind?

Will you able to visualize scattering of points of 784 dimensions?

That is why DIMENSION REDUCTION IS IMPORTANT.

Let me give you a another example, lets say you are a data scientist and you have to explain your model to clients who does not understand Machine Learning, how will you make them understand the working of 784 features or dimensions.

In simple language “INTERPRETABILITY” of the model.

That is the second reason why Dimension Reduction is Important.

Now let me give you third example, lets say you are working for a internet based company where the output of something must be in milliseconds or less than that, so there “Time complexity” and “Space Complexity” matters a lot.

More features need more Time which these company can’t afford.

SIMPLE!

So in short, there are basically three reasons for DIMENSION REDUCTION

  1. Visualization.
  2. Interpretability.
  3. Time and Space Complexity.

In this article we are solely devoted to reduce dimensions for visualization.

So visualizing 784 dimension would be very difficult, so is there is a technique with which we can reduce these dimensions.

So the first and the oldest Technique is “PCA

PCA stands for Principal Component Analysis.

So to keep it simple and clear for everyone, Lets convert 2D to 1D, if we will be able to do that, we will be applying linear algebra to higher Dimensions.

So lets understand Geometrically what is actually PCA.

GEOMETRIC INTUITION OF PCA

Our Aim is to convert 2D to 1D.

CASE 1

Lets take an example then, we have two features F1(blackness of hair) and F2(height of people)

Lets say blackness of hair be a real number, lets assume there is some criteria to measure blackness of hair in real number.

In the picture, you can see the distribution or scattering, lets say this is a distribution of Indians.

So we can easily say people with almost same blackness of hair have a huge spread of heights or we know most of the Indians have black hair that’s why it is covering all the heights.

It would be a different case if I took this distribution in America, there we can find Blond hair, Black Hair etc.

So Spread on height axis is more but Spread of blackness of hair is very less.

So it can be said that in India, blackness of hair won’t be give us much information because the spread is very low, it will not add any value to our model.

So i can remove this feature(blackness of hair), as spread is very less as we know most of Indians have black hair.

So basically the idea is if we are forced to skip one feature, we will skip features which give me less importance.

IN OTHER WORDS PCA PRESERVES DIRECTION WITH MAXIMAL SPREAD OR VARIANCE.

CASE 2

Now in this case, spread is equal on both sides so we can’t drop one feature.

Now what to do?

So if we can rotate our axis by theta to direction of maximum variance or maximum spread(refer image), then we can drop features with minimum variance.

So in our example, first step is to find F1' and F2’, second step is to drop F2' because variance on F2' is low.

So the crux is if I Define X= 2 Dimension DATASET, WE WANT TO FIND A DIRECTION F1' SUCH THAT THE VARIANCE OF PROJECTION ONTO F1' IS MAXIMUM.

HOW TO DO IT???

It involves Mathematical Objective of PCA which we will discuss later.

But for now just understand that, it will involve some optimization problem(DISTANCE MINIMIZATION)

T-SNE

It was founded by Jefry Newton(also known as Father of Deep Learning)

It is a state of art technique founded in 2008.

What is T-SNE???

T- t distribution (for now just remember that t distribution or student’s t distribution was introduced when population variance is unknown)

In next few articles, I will talk about t-distribution in detail but for now let it be.

S-Stochastic(probabilistic distribution) or it can be remember as with the same dataset if you apply T-SNE algorithm more than one time, it will give you slightly different results at different times.

N- Neighbourhood

Neighbourhood simply means if there are six points x1,x2,x3,x4,x5,x6 and distance between x1 to x2, x3 and x4 is relatively small than x5 and x6 then

x1=N(x2,x3,x4) means x2,x3,x4 are neighbourhood of x1.

and x5 and x6 are not.

E- Embedding

Lets say D is a big number for example 784

It simply means for every point in D dimensions xi(any point), we are finding the corresponding point in lower dimensions.

HOW WE FIND THAT CORRESPONDING POINT?

We will discuss it when we will explore its mathematical objective, right now we will focus on geometrical intuition.

I THOUGHT OF COVERING T-SNE GEOMETRIC INTUITION IN THIS ARTICLE ONLY, BUT THAT WILL BE TOO MUCH FOR ONE ARTICLE.

so in next article I will cover that.

Thanks for Reading…

--

--