A Complete Guide On Dimensionality Reduction

Published in

Analytics Vidhya

22 min readApr 18, 2020

Hi folks, The essence of this article is to give an intuition and to give a complete guidance on dimensionality reduction through python. I hope you enjoyed in reading to it as much as I enjoyed in writing to it for the people.

Do you know, the interest users are generating 2.5 quintillion bytes of data per day? well the data can be from anywhere. Suppose the data collected from

Social apps like facebook, instagram which collects data of what you like, shares, posts, places you visited recently, restaurants you like, etc.
Online marketing websites like youtube, google which saves the data related to your recent search, interests, views to the posts etc..,
Shopping apps like Myntra, amazon, flipkart which collects data of what you have purchased, views and clicks to a product, products that you kept in carts etc.,

Internet based applications are generating tremendous amount of data daily. They more often involves in data collection, refining and extraction. However, as the data generation and extraction keeps increasing — visualizing it, controlling and drawing patterns out of it will go out of our hands.

For example we have 1000 features and 1 million data points. Let’s say we store it in a matrix format having cells 1000 * 1 million = 1000 million cells. Which is terrific especially when we have memory constraints. It is good that we have collected huge amount of data to tackle the problems, but handling and applying models to such data is not that much easy. That is where Dimensionality Reduction comes into play. So without wasting time let’s get deep into the concept.

Chapter-1 : Introduction to Dimensionality Reduction

Chapter-2 : Principal Component Analysis

Step_2–1: Introduction and need of PCA.
Step_2–2: Working methodology.
Step_2–3: Advantages and Disadvantages.
Step_2–4: PCA on MNIST dataset through python code snippets.

Chapter-3 : Linear Discriminant Analysis

Step_3–1: Introduction
Step_3–2: Working of LDA.
Step_3–3: Extensions of LDA
Step_3–4: Python Sklearn implementation of LDA On IRIS dataset

Chapter-4 : T-Distributed Stochastic Neighborhood Embedding

Step_4–1: Introduction to T-SNE.
Step_4–2: Working methodology.
Step_4–3: Handling Crowding Problem.
Step_4–4: Advantages and Disadvantages.
Step_4–5: T-SNE on MNIST dataset through python code snippets.

Chapter-5 : Data Normalization

Chapter-6 : Conclusions

Chapter-1: Introduction to Dimensionality Reduction:

What is dimensionality reduction?

In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reduction of n-dimensions to a k-dimensions where k<<n.

Visualization of Data:

Data visualization brings easiness in understanding and increases effectiveness. The human mind learns fast from visuals than that of text and tables. It is applied to a large population, for e.g., one can remember dialogues and scenes of a movie which he might have watched years before, on the other hand, it is difficult for him to recall the subject he recently read.

Now-a-days we have a good number of tools for data visualization tools, which are fast and effective. Data visualization creates a better selling strategy. Data visualization boosts the ability to process information in an easy and faster way to compare and make conclusions out of it.

Let’s have a look at the data of various dimensions.

1-Dimensional Data:

Here, we often call the dimensions as features . As an example we have taken an 1D array and started plotting the values on a number line.

2-Dimensional Data:

Now we have taken a 2D array and started plotting the data on X and Y axis which are orthogonal to each other.

3-Dimensional Data:

Now we will work with 3D arrays and let’s plot it on X, Y and Z axis.

We can see that, as the dimensions get increased the visualization of data is getting difficult for us.

n-Dimensional Data:

Now for N-D data we need N dimensions which we can no longer visualize it. So for visualization of any data having more than 3D, we will reduce it to 2 or 3 dimensions using technique called dimensionality reduction.

Essence of Dimensionality Reduction:

It’s not feasible to analyze each and every dimensions at a microscopic level in a high dimensional data. It might take us days or months to perform any meaningful analysis which requires lot of time, money and manpower in our business which is not often encouraged. Training a data with high dimensions will lead us problems like:

Space required to store the data gets increased with increasing dimensions.
Less dimensions will take low time complexity in training a model.
As dimensions increases, the possibility of overfitting the model also gets increased.
we cannot visualize a high dimensional data. By dimensionality reduction we will reduce the data to 2D or 3D for better visualization.
It will remove all the correlated features in our data.

Components of Dimensionality Reduction:

There are two major components of dimensionality reduction which will be discussed in detail here.

I) Feature Selection:

Most of the times the features are not relevant to our problem. For example, we are training a model for predicting the heights of people and we have data with features( weights, color, moles, marital status, gender). We can see that the features like color, moles and marital status are not linked with the heights of people i.e., irrelevant to our problem of finding heights of people. Hence we need to come up with a solution of finding features which are most useful for our task. We can achieve this by following ways:

Business understanding ,domain knowledge and experts solution can help us to choose predictors(features) that are impacting response variable (target). But their may be a chance of losing information if we can’t find useful predictors or if we miss useful features.
We build a classical ML model, and we will select the features basing on the correlation with target variables. A feature with high degree of fit is more likely to be choosed compared with the feature with low degree of fit.
Reduction of features can also help us to tackle this problem. Suppose using PCA (which will discussed in the later session) which helps us to find components that projects the data into lower dimensions with less loss of information.
The other way is by removing all the correlated features. For example if we have features with linear combinations with other features (f1=2f2+3f3) then they will not add any additional information to our data. Hence such features are no longer useful for our model training.

Feature selection involves in finding a subset of original data so that their will be minimum loss of information. It has following three strategies:

Filter Strategy: Strategy to gain more information on the data.
Wrapper Strategy: Basing on the model accuracy we will select features.
Embedded Strategy: Basing on model prediction errors, we will take a decision whether to keep or remove the selected features.

II) Feature Projection:

Feature Projection also know as Feature Extraction is used to transform the data in high dimensional space to low dimensional space. The data transformation can be done in both linear and non linear.

For linear transformation we have principal component analysis(PCA), Linear Discriminant Analysis(LDA) and for non-linear transformations we apply T-SNE.

Chapter -2 : Principal Component Analysis :

Before going to the content let’s appreciate the mathematician behind this technique.

Step_2–1: Introduction and need of PCA:

PCA was invented in 1901 by Karl Pearson as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete KLT in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition, singular value decomposition (SVD) of X , eigenvalue decomposition, factor analysis , spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics.

PCA is mostly used as a tool in exploratory data analysis (EDA) and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix.

Step_2–2: Working methodology of PCA:

For better understanding of principle working of PCA let’s take 2D data.

Firstly we will normalize the data such that the average value shifts to the origin and all the data lie in a unit square.
Now we will try to fit a line to the data. For that we will try out with a random line. Now we will rotate the line until it fits best to the data.

Ultimately we end up with the following fit (high degree of fit) which explains the maximum variance of a feature.

How PCA finds the best fitting line?

Let’s work out with a single point.

Now to quantity how good the line fits to the data, PCA projects the data onto it.

i) Then it can measure the distances from the data to the line and try to find a line that minimizes those distances.

ii) or it can try to find the line that maximizes the distances from the projected points to the origin.

Mathematical Intuition :

To understand the math behind this technique let’s back to our single data point concept.

After projecting the data onto the line we will get a right angled triangle. Now from pythogoras theorem we get A² = B² + C².
We can see that B and C are inversely proportional to each other. That means if B gets larger then c must be smaller and vice versa.

Thus PCA can either minimize the distances to the line or maximize the distances from the projected point to the origin.

It is more easier to find the maximum distance from the projected point to the origin. Hence PCA finds the best fitting line that maximizes the sum of the squared distances from the projected points to the origin.

Note : Here we are taking squares of the distances so that the negative values won’t cancel the positive values.

Now we got the best fitting line y = mx + c. This is called PC1 (Principle component 1). Let’s assume we got proportions 4:1 that means we go 4 units on X-axis and 1 unit on Y-axis which explains that the data is mostly spread on the X-axis.

From pythagoras theorem

a² = b² + c² => a² = 4² + 1² => sqrt(17)=> 4.12 But the data is scaled hence we divide each side with 4.12 in order to get the unit vector. i.e.,

F1 = 4 / 4.12 = 0.97 and

F2 = 1 / 4.12 = 0.242

The unit vector that we just calculated is called Eigen vector or PC1 and the proportions of features (0.97 : 0.242) are called loading scores.

SS( distances for PC1 ) = Eigen values for PC1.

sqrt( Eigen values for PC1 ) = Singular value for PC1.

Now we do the same thing for other features to get principal components. For projecting the data now we will rotate the axis so that PC1 gets parallel (horizontal) to the X-axis.

Rotating the axis so that PC1 gets horizontal

Projecting the data basing on principal components.

For visualization let’s project the data basing on the projected points on both principal components.

We can see that it is equal to the original projection of points.

How to calculate variation?

We can calculate it using eigen values which is calculated in PCA.

Let suppose we got variation for PC1 = 0.83 and for PC2 = 0.17

Now if we want to convert the data from 2D to 1D we choose feature1 as a final 1D since it covers 83% of the total variation.

This is how PCA works and basing on the variance obtained using principal components it estimates the features to be eliminated for dimensionality reduction.

Step_2–3: Advantages and Disadvantages:

Advantages :

It removes correlated features.
Improves model efficiency.
Reduces overfitting.
Improves Visualization.

Disadvantages :

PCA is a linear algorithm and it won’t work very well for polynomial or other complex functions. We can some how use kernel PCA for such data.
After PCA we may loose lot of information if we won’t choose the right number of dimensions to get eliminated.
less interpretability, since the original features transforms to principal components which are not as readable as original features.
It preserves global shapes rather than local shapes.

Step_2–4: PCA on MNIST dataset through python code snippets:

I have downloaded the data(train.csv) from kaggle which contains 784 pixel values of an image in .csv format.

dimensional representation of a mnist image

As a preprocessing step we standarlized the data so that mean value shifts to the origin and all the data lie in unit square.

PCA can be applied in two ways, one by finding eigen vectors and other by using sklearn implementation. In most of the cases, both implementations will give similar results.

Method-1:

Here we will find the covariance matrix which in intern used for finding the eigen values and eigen vectors.

After transforming the data to 2 dimensions, we will use those 2 features for visualization of data.

Principal Component Analysis through eigen vectors

Method:2

Now we will use sklearn implementation of PCA which is literally done in few lines.

Principal Component Analysis through sklearn

Let’s see the percentage of variance explained by each feature.

If I want to preserve 80% of the data information then I can reduce the dimensions to 110. These plot will help us to choose the right number of features for elimination.

Chapter-3 : Linear Discriminant Analysis(LDA):

Step_3–1: Introduction:

LDA is the most commonly used as dimensionality reduction technique in the preprocessing step in machine learning and in statistics, pattern recognition. While the goal of this algorithm is to project a dataset to a lower dimensional space with a separability of the categories in order to avoid overfitting and to reduce the computational power of the machines.

The original discriminant analysis was developed by Ronald Fisher in 1936 which is described for two class classification problem and it was later generalized as a multi class discriminant analysis by CR Rao in 1948 which is now called as Linear discriminant analysis or Normal discriminant analysis and Discriminant function analysis.

Step_3–2: Working of LDA:

Both PCA and LDA are linear reduction techniques but unlike PCA, LDA focuses on maximizing the separability of two groups.

LDA uses features to create a new axes and tries to project the data onto a new axes in a way to maximize the separation of the two categories or groups. This is why LDA is a Supervised learning algorithm since it is makes use of target values to find the new axes.

PCA tries to find the components that maximizes the variance, while on the other hand LDA tries to find the new axes that

i) Maximizes the separability of the categories and

ii) Minimizes the variance among categories.

By minimizing the variance, we can well separate the clusters of individual groups. Hence it is as important as maximizing the mean values of groups.

Now LDA finds the new axes basing on the criteria of maximizing the following formulation

LDA for more than 2 categories:

Consider the data which has more than 2 groups in such cases LDA finds the mean value of whole data and the center among the individual groups, now it tries to maximize the distance from the center mean value to the individual group mean value. For better understanding look at the following data with 3 categories.

Now we can find a plane that best separates the three groups.

Algorithm:

Compute the d-dimensional mean vectors for the different classes from the dataset.
Compute the scatter matrices (in-between-class and within-class scatter matrix).
Compute the eigen vectors (e1,e2,…,ed) and corresponding eigenvalues (λ1,λ2,…,λd) for the scatter matrices.
Sort the eigen vectors by decreasing eigenvalues and choose k eigen vectors with the largest eigen values to form a d×k dimensional matrix W(where every column represents an eigen vector).
Use this d×k eigen vector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: Y=X×W(where X is a n×d-dimensional matrix representing the n samples, and y are the transformed n×k-dimensional samples in the new subspace).

Step_3–3: Extensions of LDA:

Linear Discriminant Analysis fails when the mean of the distributions are shared (groups with high variance), as it becomes impossible for LDA to find a new axis that makes both the classes linearly separable.Also LDA fails in the cases when the data is not linearly separable. In such cases, we can use non-linear discriminant analysis.

Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as splines.
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.

Step_3–4: Python Sklearn implementation of LDA On IRIS dataset:

Let’s take IRIS dataset for LDA as dimensionality reduction technique.

Like PCA, LDA can also be implemented using sklearn. We have reduced the data from 4 dimensions to 2 dimensions using LDA.

To know the difference between the working of PCA and LDA, let’s look at the following plot. Where PCA tries to maximizes the variance unlike LDA which tries to maximizes the separability of three categories.

We can see the difference between the both plots. In PCA, their is some overlapping in the data and it is difficult to find a line separating the two groups. LDA can help us to separate the three groups since their is less overlapping in the data.

Chapter-4 : T-Distributed Stochastic Neighborhood Embedding(T-SNE):

Step_4–1: Introduction to T-SNE:

T-SNE is a machine learning algorithm often used for visualization developed by Laurens van der Maaten and Geoffrey Hinton (God father of Deep learning). It is a non-linear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

T-SNE has been used for visualization in a wide range of applications, including computer security research, music analysis, cancer research, bioinformatics, and biomedical signal processing.It is often used to visualize high-level representations learned by an artificial neural network.

Step_4–2: Working methodology:

Before going into mathematical intuition let’s learn some of the terminologies involved in T-SNE.

Neighborhood:

The neighborhood of a point is defined as the cluster of points which are geometrically close to each other.

N(X1) = { xj S.T xi,xj are closer to each other }

Embedding:

Embedding is a process of projecting a point in high dimensional space into a low dimensional space by creating xi¹.

Stochastic:

First, T-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked. That is why T-SNE is called Stochastic (probabilistic).

Now T-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. Note that while the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate.

So now the cost function of T-SNE is

Step_4–3: Handling Crowding Problem:

Crowding is a situation where we are trying project points in a smaller space where everything gets messed due to insufficient space. For example, Think that you are in a city bus where all seats are filled and still bus conductor is allowing people into the bus. In such situations crowding happens.

T-Distribution:

By using Gaussian Distribution all the low similarities values get falls on the tail region of the curve and we can see we have very few space at tails, unlike T-distribution is taller on the ends of the curve and get more space to keep separate of this clusters( think like the bus is extended with few more seats). Ultimately low similarity values get mixed up with other low similarity values of different clusters. To overcome this problem traditional SNE algorithm is replaced with T-SNE in which T indicated T-distribution.

Step_4–4: Advantages and Disadvantages:

Advantages:

Unlike PCA, T-SNE is a non-linear reduction technique that means it works well with any polynomial or non linear data.
T-SNE is a capable to preserve local and global structures while PCA tries to project High D to low D that explains most of the variance in the data. Hence it only cares about global structures.
T-SNE is widely used in visualization tasks.

Disadvantages:

T-SNE has a quadratic time and space complexity because we are finding similarity between points with every other points in the data (order of 0(N²)). Which is huge especially when we have time and memory constraints.
It is not recommended to use T-SNE while working with large datasets.
T-SNE is a non-parametric mapping method that means it doesn’t have explicit function that maps the given point to a low dimensional space. T-SNE embeds the points into low dimensions based on the neighbourhood of that point. So when a test data point comes, as it is not present before, we need to train the whole T-SNE algorithm again for embedding which is rarely used because of it’s quadratic time-complexity.

Barnes-Hut SNE (BHTSNE):

This technique is introduced in 2014 which is pretty much similar to T-SNE but with minor changes. This algorithm makes use of Barnes-Hut algorithm which is used by astronomers to perform N body simulation to approximate the forces between corresponding points.

These algorithm leads to substantiate computational advantages over standard T-SNE( O(N log N) ) which is far better than quadratic time complexity. This can be implemented easily using sklearn manifold.TSNE library by using method = ‘barnes-hut’ attribute.

Step_4–5: T-SNE on MNIST dataset through python code snippets:

We have seen how PCA works on MNIST. Now let’s try T-SNE with the same dataset. Unlike PCA, T-SNE has two parameters: Perplexity and n_iter. We will try to fit the data with different values of these parameters.

with perplexity = 30 and n_iter = 1000:

TSNE with perplexity =30 and maxiter=1000

with perplexity = 50 and n_iter = 1000:

TSNE with perplexity = 50 and maxiter =1000

with perplexity = 2 and n_iter =1000

TSNE with perplexity =2 and maxiter =1000

we can see has the perplexity got decreased, the data is mixed up with all the clusters. Hence it is always important to choose the right parameter values.

with perplexity = 100 and n_iter =1000:

TSNE with perplexity =100 and maxiter=1000

Out of all the values perplexity =100 is doing good on our data. But note that we tried T-SNE on 5000 data points only, Since it takes lot of time because of its time complexity.

Chapter-5 : Data Normalization:

By data normalization we can transform the data from the infinite range to a finite set of range. The main purpose of data standardization is to make your data consistent and clear. Consistent means to ensure that the output is reliable so that related data can be identified using common terminology and format.

Need of Data Normalization before dimensionality reduction:

Let’s say there are two features where one feature whose values lie in a range of 1 to 10 (Number of buyers in a marketplace per hour) whereas the another feature values lying in a range 50 to 1000 (Number of visitors to the marketplace). Probably Number of visitors per hour are >> Number of buyers per hour.

Since Techniques like PCA is based on maximizing the variance, if we don’t apply normalization before applying PCA to find eigen vectors, they would focus more on the larged valued dimension, Hence the eigen vectors would not capture the information present in other dimensions.

Hence Feature Normalization is used to get rid of scales like kg,cm,mm,litres etc.., After normalization when we plot all the data lies in a unit square .

Techniques of data normalization:

Data can be transformed in many ways. Some of the most commonly used techniques are:

Linear Scaling or min-max Scaler:

Where X = data point

Xmin = minimum data point

Xmax = maximum data point

we will use this technique when we know the approximate upper and lower bounds on our data with few or no outlier. Also when we know that our data is approximately uniformly distributed across the range. After rescaling all the values will lie in a range of [0, 1].

2. Feature Clipping:

When the data contains extreme outliers, these technique caps the feature values above or below a certain value to a fixed value. For example, clipping all the values of height beyond 120cm to be exactly 120 cm. That means we are squashing the values to a fixed range.

3. Log Scaling:

It is used when the distribution of data follows powerlaw or it is a pareto distribution i.e., one value have many points and other values have very few. For Example, take two movies one is hit and other is flop. We will get more ratings on hit movie compared to a flop movie.

4. Z-Score or Standarlization:

After standarlization the mean is transformed to 0 and standard deviation is transformed to 1. It is useful when we have few outliers, but not so extreme that we need clipping.

Chapter-6: Conclusions:

Dealing with thousands and millions of features is a must-have skill for any data scientist. The amount of data we are generating each day is unprecedented and we need to find different ways to figure out how to use it. Dimensionality reduction is a very useful way to do this and has worked wonders for me, both in a professional setting as well as in machine learning hackathons.

Also dimensionality reduction may not work very well in all situations. In such cases we need to think out of the box to transform the data into lower dimensional space, also it is important to work with different new reduction algorithms on the same data.

That’s all for today. Hope you have got some basic idea of using dimensionality reduction in real world scenario’s.

References:

Well I will update this page as soon as I find the new reduction algorithms.As an aspirant of becoming Data Scientist brings me here. This includes my final work for this concept. Thankyou for reading to my article.

You can check the .ipynb for full code snippet of this article in my Github repository. You can also check my first case study on price suggestion challenge in kaggle.

Follow me for more such articles and implementations on different real-world case studies and articles in data science! You can also connect with me through LinkedIn and Github

I hope you enjoyed reading to my article. There is no end for learning in this area so Happy learning!! Signing of bye :)