Getting Started with Dimensionality Reduction

Ananya Srivastava
WiCDS
Published in
7 min readJan 19, 2021
Image by 8926 from Pixabay

The amount of data being generated today is enormous and is growing exponentially. This data comprises of hundreds or even thousands of columns and is referred to as Big data. It has applications in possibly all fields one can think of whether it is Healthcare, Retail, Education, Telecom, Entertainment, the list goes on.

Although large amounts of data help machine learning models to learn and thereby generalize better, it may also result in indiscriminate addition of noise that slows down the learning process. This is the Curse of Dimensionality.

The Curse of Dimensionality

The Curse of Dimensionality means that the error increases with increase in the number of features. As dimensionality increases, the performance improves till a threshold value indicating the optimal number of features. Further increasing the dimensionality results in drop in classifier performance. Ideally, higher number of features help in more effective learning and better results. However, in practice, this adds noise and often redundant information which also leads to overfitting. This is where the concept of Dimensionality Reduction comes into the picture.

In simple terms, Dimensionality Reduction deals with reducing the number of input features or columns in the data to reduce the complexity of the model.

How does dimensionality reduction help?

Dimensionality reduction aims at transforming a higher dimensional feature space into a lower dimensional feature space that includes all the relevant information.

Each column in the data represents a dimension in an n-dimensional feature space. We are reducing the number of these columns using dimensionality reduction techniques to remove misleading data. This eventually helps in reducing the complexity and improving the performance of the model.

We want to reduce the number of features and at the same time preserve the essence of the original data. For this, we can simply drop certain columns that we know are irrelevant to the problem that we are trying to work out. If the relevance of these features is not obvious, we go for dimensionality reduction techniques.

Breaking the Curse of Dimensionality

Dimensionality Reduction can be performed through both feature engineering and feature selection methods.

Feature engineering and Feature Selection (Source)

In Feature Selection, we identify and select relevant features based on our intuition or use a model to find the best features on its own. It includes techniques like Filter, wrapper and embedded method. You can learn more about Feature Selection here.

Feature Engineering or Feature Transformation methods generate new features after transforming data from a higher dimensional space to a lower dimensional space. Some feature engineering approaches are PCA, LDA, t-SNE, Auto-encoders, UMAP, Matrix Factorisation etc.

The difference between feature selection and feature engineering is that feature selection keeps a subset of the original features while feature engineering makes new ones.

In this blog, we will be diving deeper into the following four popular feature engineering techniques for dimensionality reduction:

  1. PCA
  2. LDA
  3. t-SNE
  4. Auto-encoders

PCA: Principal Component Analysis

PCA is an unsupervised learning technique that performs dimensionality reduction by maximizing the variance of the data. It is a projection based method which transforms the data by projecting it onto a set of orthogonal axes.

The objective of PCA is to maximize the variance of the data when it is mapped into a lower dimensional space from a higher dimensional space. The features which result in maximum variance are the principal components.

Principal Component Analysis (Source)

These principal components are orthogonal, i.e., they are uncorrelated and are ranked in order of their variance. The first principal component (PC1) explains the highest variance, second principal component (PC2) explains the second highest variance in the dataset and so on.

The algorithm involves the following simple steps:

  1. Standardize the higher dimensional dataset. (This is an important step as the transformation is dependent on scale and we don’t want features with a larger numeric range to dominate the new principal components)
  2. Compute the covariance matrix.
  3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
  4. Sort the eigenvalues in decreasing order to rank the corresponding eigenvectors and select k(dimensionality of the new feature subspace eigenvectors corresponding to the k largest eigenvalues.
  5. Construct a projection matrix from the top ‘k’ eigenvectors.
  6. Transform the input dataset using the projection matrix to obtain the new lower dimensional feature subspace.

PCA is most suitable when variables share a linear relationship. It becomes hard to interpret principal components when we have a large corpus of data.

LDA: Linear Discriminant Analysis

LDA is a supervised technique that maximizes the separability between classes. It is suitable for data that lies on a linear subspace.

Good Projection separates the classes well and bad projection causes undesirable overlapping of classes (Source)

PCA, as we discussed tries to find the axes that maximize the variance, which here in the figure represents the “Bad Projection”. This shows that if a feature has a high variance, it does not mean that it will be predictive of the classes. On the other hand, the “Good Projection” will give the best separation between the classes. LDA considers both — the good projection and the bad projection but finally selects the one with maximum class separability.

Similar to PCA, LDA is also dependent on scale and therefore, it is always better to standardize our dataset before performing the transformation. LDA works better than PCA for multi-class classification and can be used for comparatively larger datasets.

t-SNE: t-distributed Stochastic Neighbor Embedding

t-SNE is another unsupervised algorithm for performing dimensionality reduction. Why do we need t-SNE when we already have PCA?

As we know, PCA seeks to maximize variance and preserves large pairwise distances which means that different features may end up very far from each other. This leads to poor visualization in cases where there is non-linear data. This is where t-SNE comes in, preserving small pairwise distances or local similarities and therefore, is well suited for non-linear data.

t-SNE vs PCA for non-linear dimensionality reduction (Source)

The algorithm models the probability distribution of neighbors around each point. Neighbors are the closest set of points. It first creates a probability distribution on pairs in higher dimensional space such that similar objects are assigned a higher probability and dissimilar objects are assigned lower probability. Then, t-SNE replicates the same probability distribution in lower dimensional space. This process is carried out iteratively till the difference between the two probability distributions is minimized.

AutoEncoders

AutoEncoders are a type of unsupervised artificial neural network that are known to give amazing results when used for dimensionality reduction. They compress the data to lower dimensions and then reconstruct the input again. The bottleneck hidden layer, also known as the latent vector helps AutoEncoders to perform dimensionality reduction.

This network is fully connected although all connections are not shown. (Source)

AutoEncoders consist of two parts:

  1. Encoder: It maps the input space into a lower dimension latent space.
  2. Decoder: It takes the data from the latent space and maps it to reconstruction space where the dimensionality of the output is almost equal to the dimensionality of the input.

AutoEncoders train by minimizing the reconstruction error and have the capability to model complex relationship between features. Thus, they can be used for both, linear and non-linear mapping of data by selecting an appropriate activation function. They perform well when we have a large corpus of data.

Dimensionality Reduction makes it easier to analyze and visualize data. However, we must also note that the reduction of dimension requires a trade-off between performance of the model and its computational efficiency.

Thanks for the read. I welcome your feedback, comments and recommendations. I’ll be posting more beginner friendly blogs in future. You can reach out to me on LinkedIn Ananya Srivastava. Happy Learning!

For further reading:

--

--

Ananya Srivastava
WiCDS
Writer for

Full Stack Dev • Machine Learning Enthusiast • Research Intern at SCAAI