PCA vs LDA vs T-SNE — Let’s Understand the difference between them!

Published in

Analytics Vidhya

9 min readFeb 17, 2020

We have seen these methods whenever we mentioned about higher dimensional data or how to visualize the data that has hundreds of attributes or even more.
The solution on which we always landed upon is one of the dimensionality reduction technique help us to solve this problem. But, here the main question arises is when to use which one? What is the basic difference between each of them and what is the intuition of each of these techniques?
In this article, we are going to look into the solution(s) of the above-mentioned questions.

Let’s get started!

Principal Component Analysis (PCA)

PCA is an unsupervised machine learning method that is used for dimensionality reduction. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.

This is done by transforming the variables into a new set of variables which is a combination of variables or attributes from our original dataset in such a way that maximum variation is retained. This combination of attributes is known as Principal Components (PCs) and the component which has the maximum variance captured is called the Dominant Principal Component. The order of retention of variance decreases as we move down in the order, i.e. PC1 > PC2 > PC3 >… so on.

Transforming 2D data into 1D ( PC1 contains maximum variance)

Transforming 3D data into 2D/1D ( PC1 > PC2 > PC3 )

Once we transform the data into principal components, then we can choose to drop the variables which do not have variance. This gives a way to reduce dimensions and focus on the ones with larger variance.

Why do we use PCA?

Practically PCA is used for two reasons:

Dimensionality Reduction: The information distributed across a large number of columns is transformed into principal components (PC) such that the first few PCs can explain a sizeable chunk of the total information (variance). These PCs can be used as explanatory variables in Machine Learning models.
Visualize Classes: Visualising the separation of classes (or clusters) is hard for data with more than 3 dimensions (features). With the first two PCs itself, it’s usually possible to see a clear separation.

Is PCA a feature selection technique?

It is not a feature selection technique. Rather, it is a feature combination technique. Because each PC is a weighted additive combination of all the columns in the original dataset.

PCA Methodology

Step 1: Standardize each column
If there are values of a different order of magnitude, then scale/standardize them. Convert the categorical variable into dummy numerical variables because PCA works only on numerical data.

Step 2: Compute Covariance Matrix
Start with the analysis of the covariance matrix of the features.

Why the Covariance Matrix?
Covariance measures how two variables are related to each other, that is, if two variables are moving in the same direction with respect to each other or not. When covariance is positive, it means, if one variable increases, the other increases as well. The opposite true when covariance is negative.

The covariance matrix calculates the covariance of all possible combinations of columns. As a result, it becomes a square matrix with the same number of rows and columns.

The action of a matrix on a general vector can be thought of as a combination of stretch and rotation.

Covariance Matrix with Stretch and Rotation

Step 3: Compute Eigenvalues and Eigenvectors

For a given matrix there exists a special direction along which the effect is only stretch (without rotation), these special directions are called eigenvectors or Eigen directions.

The eigenvectors and eigenvalues of a matrix A are defined to be non-zero X and ⲗ values that solve,

AX = ⲗX (A is just stretching)

For an n-dimensional square matrix, there are ’n’ eigenvectors and ’n’ eigenvalues.

Eigenvectors are Principal component directions and eigenvalues are magnitude of variance along those directions.

In the above example, [-0.49, 0.87] is the principal component ( eigenvector) and 5.51 is the magnitude of stretch (eigenvalue)

Step 4: Derive principal component features

By taking the dot product of eigenvector and standardized columns, derive the principal component features.

Linear Discriminant Analysis (LDA)

LDA is a supervised machine learning method that is used to separate two groups/classes. The main idea of linear discriminant analysis(LDA) is to maximize the separability between the two groups so that we can make the best decision to classify them. LDA is like PCA which helps in dimensionality reduction, but it focuses on maximizing the separability among known categories by creating a new linear axis and projecting the data points on that axis.

LDA doesn’t work on finding the principal component, it basically looks at what type of point/features/subspace gives more discrimination to separate the data.

The objective of LDA is to find a line that maximizes the class separation. Therefore to do this we need to define a good separation measure.

Mean Vector

The mean vector is used to find the mean of the data points of each class.

Driving force of separation

The goal is to find the best set of w, which gives the maximum separation, i.e. the distance between the two means is maximum.

Hence, the objective function would be

L1 Norm:

L1 Norm Objective Function

However, the distance between the projected means is not a very good measure since it does not take into account the standard deviation within the classes.

How to define which class is better?

The data where the invariance within the class is minimum and the variability among the other classes are maximum is considered to be good.

The solution proposed by Fisher is to maximize a function that represents the difference between the means, normalized by a measure of the within-class (intra-class) variability, or the so-called scatter.

Note: Scatter = Variance

For each class, we define the scatter, an equivalent of the variance, as; (sum of squared differences between the projected samples and their class mean).

Si² measures the variability within class ωi after projecting it on the Y-space.

S1² + S2² measures the variability within the two classes at hand after projection, hence it is called intra-class scatter of the projected samples.

Hence, the Fisher linear discriminant is defined as the linear function that maximizes the criterion function: (the distance between the projected means normalized by the within-class scatter of the projected samples.

The objective is to Maximize J(w) ( L2 Norm):

Similarities between PCA and LDA:

Both rank the new axes in the order of importance.

PC1 (the first new axis that PCA creates) accounts for the most variation in data, PC2 (the second new axes) does the second-best job and so on…
LD1 (the first new axis that LDA creates) accounts for the most variation in data, LD2(the second new axes) does the second-best job and so on…

2. Both the algorithms tell us which attribute or feature is contributing more in creating the new axes.

3. LDA is like PCA — both try to reduce the dimensions.

PCA looks for attributes with the most variance.
LDA tries to maximize the separation of known categories.

T-Distributed Stochastic Neighbour Embedding (T-SNE)

T-SNE is an unsupervised machine learning method that is used to visualize the higher dimensional data in low dimensions. T-SNE is used for designing/implementation and can bring down any number of feature space into 2-D feature space.

Both PCA and LDA are used for visualization and dimensionality reduction but T-SNE is specifically used for visualization purposes only. It is well suited for the visualization of high-dimensional datasets.

Unlike PCA and LDA, T-SNE is a non-linear data visualizer. It means it doesn’t form a linear line to separate the classes or to calculate the variance and it doesn’t use any norm or distance metric to calculate the distance between points.

Overview of working of T-SNE:

The algorithm starts by calculating the probability of similarity of points in high-dimensional space and calculating the probability of similarity of points in the corresponding low-dimensional space. The similarity of points is calculated as the conditional probability that a point A would choose point B as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian (normal distribution) centered at A. It uses T-test from the T-distribution.
It then tries to minimize the difference between these conditional probabilities (or similarities) in higher-dimensional and lower-dimensional space for a perfect representation of data points in lower-dimensional space.
To measure the minimization of the sum of the difference of conditional probability t-SNE minimizes the sum of Kullback-Leibler divergence of overall data points using a gradient descent method.

Note: Kullback-Leibler divergence or KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution.

P II Q: Tells how much P diverges from Q

Relationship to Shannon Entropy

The Shannon entropy is the number of bits necessary to identify X from N equally likely possibilities, less the KL divergence of the uniform distribution from the true distribution.

Shannon Entropy ~ KL divergence Relation

T-SNE gives the impression that it has classified the data by bringing it to two-dimensions but in reality, it doesn’t reduce the dimensions. It is a visualizer, which tells how each class is distributed and is there any overlap between them.

In hyper-dimensional space, the euclidean distance becomes useless. The similarity between the classes in higher dimensions corresponds to the short distance between them in low dimensions. ( the data points of each class becomes very close) .

T-SNE minimizes the sum of KL divergences over all the data points using a gradient descent method.