Dimensionality Reduction. Untangled.

As the 12A12D series is progressing, there are some miscellaneous topics to be covered. One of them is Dimensionality Reduction, or simply, dealing with large number of dimensions/features/variables.

Progress so far! :D

This algoneeds no prior knowledge. It is the added spice needed to rock the world of DS.

Too many cooks spoil the broth.

What is Dimensionality Reduction?

Most of the time, we deal with datasets having lots of redundant parameters that don’t provide significant amount of new information to us. Using these parameters in building our model won’t help in increasing our accuracy for prediction and may decrease too!

One way to deal with it could be by deleting these parameters but this would lead to significant data loss if there are many such parameters.

Hence, dimensionality reduction comes into the picture.

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)

Principal Component Analysis

It works on the concept of retaining maximum variance of data.

Consider, we have N parameters in our dataset. For reduction of dimensions:

  • PCA makes N independent principal components which are linear combination of given parameters. 
    For advanced mathematics, go through this article.
  • The variance along each component has the following relation:
    PCA_1 > PCA_2 > …. > PCA_n (PCA_i denotes principle component i)

Implementation of this algorithm can be easily explained by considering an example of a dataset with two dimensions.

In the graph below, we can see that the entire dataset can be plotted on two new axis which are first principal component (PCA_1) and second principal component (PCA_2) . Most of the variance is along PCA_1 and remaining amount of variance is along PCA_2.
Now, if we project all the points on PCA_1, variance will be lost along PCA_2.

Dimensions get reduced with small amount of loss in variance.
PCA_1 is more influential.

Implementation in Python

Sample data. :D

We see that there is no significant increase in variance after 33 parameters. Retain maximum variance of dataset using 33 features, not 44.

  • Performing PCA on parameters having different scales will lead to insanely large biasing for parameters with high variance. In turn, this will lead to dependence of a principal component on the parameter with high variance. This is undesirable. SAME SCALES.
  • Works best if parameters have a high linear relation with each other.
PCA/LDA is easy, but mathematical!

Linear Discriminant Analysis

Used for classification problems.
Unlike PCA where our goal was to retain maximum variance, here we try to project a dataset onto a lower-dimensional space with good class-separability in order to avoid over-fitting.

Consider an example of a dataset having two parameters x1 and x2. 
It is divided in two classes as shown in figure below.

  • Just like PCA, LDA finds a line on which it projects the dataset. In order to find this line, LDA performs two tasks simultaneously.
  • First, it finds a line on which distance between the centers of two different classes will be maximum.
  • Secondly, it tries to minimise the spread of each class on that line.
    In our case, spread can be calculated by finding distance between two farthest points of a class on the line.

If distance between centers of class 1 and class 2 on the line is d and spread of each class is s1 and s2 respectively, LDA tries to find a line where the fraction (d)²/(s¹² +s²²) is maximum.

Projection.

Implementation in Python

Code is straight-forward.

LDA is performed on the iris dataset where 4 parameters are used to divide the data in 3 different classes. You can find more about it here
Using LDA, the number of parameters are reduced to 2.

Plot of different classesl

PCA v/s LDA

The difference between the choice of a lower dimension space for projecting dataset is evident. If we use PCA on the dataset, it will select a line similar to the decision boundary as it will have maximum variance along it. 
In practice, often LDA is done after performing PCA.

Difference.

References

  1. AV blog
  2. Sklearn Decomposition module
  3. Machine Learning Mastery

Footnotes

First of all, I would request everyone to comment out their reviews, suggestions and comments. Thanks for following! :)

This post dealt with covering the two small concepts. Hope you could decipher. Ahead, better ones await. 6 to go!

Thanks for reading. :)
And, ❤ if this was a good read. Enjoy!

Co-Author: Abhinav Tripathi