PCA for Dimensionality Reduction and Visualization: An Intuitive Explanation

9 min readJul 28, 2023

Hello, and thank you for reading my blog. In this comprehensive guide, I will share my knowledge and experience on one of the most essential and widely discussed topics in data science: PCA for Dimensionality Reduction and Visualization.

What is Dimensionality reduction?

Dimensionality reduction is a fundamental technique used to simplify complex datasets by reducing the number of features or variables while retaining essential information. This technique is particularly useful for high-dimensional datasets where a large number of features can cause problems such as overfitting, computational inefficiencies, and difficulties in visualization. By reducing the number of features, dimensionality reduction can help to improve the accuracy of machine learning models and speed up computation time. Moreover, it can also aid in identifying the most important features in a dataset, which can help to improve interpretability and provide insights into the underlying relationships between the variables.

Essentially, the goal of dimensionality reduction is to transform high-dimensional data into a lower-dimensional space, while preserving the underlying structure of the data. By doing so, we can extract the most meaningful and relevant features of the data, which can help improve performance, reduce computation time, and simplify the data analysis process.

Humans can easily visualize data when it is presented in 2D or 3D using simple scatter plots. But when the dimensionality of the data increases, it becomes difficult to visualize and comprehend the information. For instance, when dealing with 4D to 6D data, we can leverage pair plots to gain a sense of the underlying patterns. When dealing with even higher dimensions, these techniques become insufficient.

So, we need a technique which enable us to transform n-dimensional data points to m-dimensional space, where m<n, making the data easier to handle and interpret. Popular dimensionality reduction techniques include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), UMAP, etc.

Where it will be useful?

Consider the example of the MNIST dataset, which consists of 784 dimensions. Do you think it’s possible to visualize it with human eyes?

If someone claims so, they are likely misleading you.

But with dimensionality reduction techniques, we can transform the dataset into 2 or 3 dimensions, making it easier to visualize and interpret.

Reducing dimensionality is not just limited to visualization purposes. There are two key areas where dimensionality reduction is extremely beneficial. First, when dealing with high-dimensional data, we may encounter the “curse of dimensionality,” where the number of data points required for accurate modeling increases exponentially with each additional dimension. By reducing the number of dimensions, we can mitigate this problem and enable more efficient and accurate modeling.

Secondly, in machine learning, reducing dimensionality can improve the performance of models by reducing noise and improving accuracy. This is especially true for models that are prone to overfitting, where high dimensionality can lead to overfitting the training data and poor generalization to new data.

Curse of Dimensionality:

As the dimensionality of a dataset increases, the distance between data points also increases, making the distance calculation more complex and potentially leading to decreased model generalization. High dimensionality can also cause models to overfit and affect different models to varying degrees.

Certain algorithms, such as K-Nearest Neighbors (KNN), K-Means, and Decision Trees, rely heavily on distance-based calculations, making them particularly susceptible to issues arising from high dimensionality. While one solution to this problem is to acquire more data, this is not always feasible or practical. As such, reducing the dimensionality of the data can be a useful practice to mitigate these issues, leading to faster computation and improved model performance.

As the number of dimensions increases, the sense of the relative distances and “neighborhood” vanishes.

All the below phenomena can occur because of the high dimensionality of data,

Exponentially increasing computational complexity and processing time.
Difficulty in visualizing and understanding data.
Higher risk of overfitting and loss of information due to sparsity.
Reduced accuracy and performance and Increased noise/outliers in the dataset
Limited interpretability of results and difficulty in drawing meaningful conclusions.
Generalization won’t happen and the model might not be able to utilize important features due to the impact of other features.

“End-of-the-day modeling is all about finding the balance or the correct threshold where the model can have the more prediction power and simple.”

So, to address these issues, there are three commonly used techniques:

PCA — Principal Component Analysis.
t-SNE — t-Distributed stochastic neighborhood embedding.
UMAP — Uniform Manifold Approximation and Projection.

Principal Component Analysis

Firstly lets see how to reduce dimension from 2D to 1D?

As an simple example, here you have datapoints of human hair color and the heights. generally the heights can vary from person to person but in india the hair color mostly black and few people can have different colors and shades of black as well. So if you see here as a dotted line represents the variance (spread) of the dataset and in x axis its very less. So, the very important feature here is the heights comparing the hair color because it has more variance. I am preserving the direction with more spread. more spread means it has more information which is very useful. So, i can drop the feature x here to convert from 2D to 1D.

Lets see a slightly different scenario, we again have a two dimensional dataset and its column normalized. So, now the variance is same because it has a mean=0 and variance is 1. Now how do i get rid of 1 dimension?

The spread on x and y are mostly similar so you could not simply drop one feature from it. But if you notice here the direction of these two features, if you rotate this to a certain angle still X’ and Y’ are perpendicular same as X and Y. The spread on Y’ is very lesser than X’.

Somehow if i find this angle and drop Y’ and project the datapoints on X’ i can convert successfully from 2D to 1D. So, this is just a technique of rotating axis. we want to find the direction such that the variance of X’ is maximum so we can drop Y’.

There are two ways to solve this problem,

Find a direction which maximize the variance.
Distance minimization.

Variance maximization

The Task is to find u1 such that the variance of point xi projected on u1 is maximal.

Assume that the direction u1 is a unit vector (a unit vector has a length equal to 1), and xi is the point I am projecting onto the plane x’. Let xi’ be the projection of xi onto u1. This projection will result in a new dataset D. The mathematical function representing this projection could be as follows:

D’ is the new dataset we will get after this projection. The task here is to find u1 such that the variance of the datapoints projected onto u1 is maximal.

Here, maximizing u1 is our objective function, and u1^T u is the constraint we have. Therefore, it is called a constrained optimization problem in PCA. If I set u1 to infinity, it might seem like it will be maximal, but the constraint here is that it must be a unit vector.

Distance minimization:

For each xi we have to get the di which is the distance from xi to ui. So, we have to find the distance d,

As per trigonometry,

Here, we can either minimize or maximize ‘u1,’ which are two different methods, but intuitively, it is the same ‘u1’ that we need to find to either maximize the variance or minimize the distance.

The solution to our optimization problems lies in utilizing the concept of eigenvalues and eigenvectors. Assuming our dataset ‘X’ is column-standardized, the covariance matrix ‘S’ of the random variable ‘X’ can be represented as ‘X^T * X’, where ‘X’ is an nd matrix and ‘X^T’ is a dn matrix, making ‘S’ a d*d matrix.

To maximize the variance, we need to find the direction of u1. This is where eigenvalues and eigenvectors come into play. Each eigenvalue, denoted as λ1>λ2>λ3>λ4…λd (since S is a d*d matrix), has a corresponding eigenvector, represented as V1>V2>V3>V4…λd.

We can establish a relationship with our symmetric matrix ‘S’ as follows: λ1 * V1 = S * V1 (where λ1 is the eigenvalue of S, and V1 is the eigenvector of S corresponding to λ1).

The property of eigenvectors states that any two eigenvectors are perpendicular to each other. Moreover, every pair of eigenvectors is also perpendicular to each other.

By employing this property, we can determine that u1, which maximizes the variance, is equal to V1, i.e., the eigenvector of S corresponding to the largest eigenvalue, λ1.

basically, you can determine the number of dimensions needed to cover the maximum variance of your dataset. For instance, in a 100-dimensional dataset, you can calculate this using eigenvalues as follows:

Percentage of variance explained = Ratio of λi/Sum of λ

For example, if λ1 = 3 and λ2 = 1, and you choose to use λ1, then the calculation would be: 3 / (3 + 1) = 3 / 4 = 75%. This means that by using just one dimension, you can retain 75% of the information present in the original dataset.

So, now if I have a 2D dataset, I can convert it into 1D simply using the maximal variance method as follows:

The new dataset Xi’ can retain maximum variance with eigen vector V1 so we can eliminate of the datapoint V2

Let’s now see some code to understand the steps that we have seen so far theoretically.

First, let’s consider our MNIST dataset, which consists of 784 dimensions, making it impossible for the human eye to visualize directly. However, we can still gain a better understanding by applying the PCA concept manually. Here are the steps to achieve this: By using the top 2 eigen vectors, we can reduce the dimensions from 784 to 2D, allowing us to create a visualization.

Implementing PCA without scikit-learn’s inbuilt package.

If you followed the steps above, i have manually implemented PCA on the MNIST dataset, even though there is an inbuilt package available. This exercise provides a valuable understanding of how the top 2 components can help visualize the dataset. Now, let’s explore how this process is simplified using sklearn’s PCA.

Implementing PCA using scikit-learn’s inbuilt package.

PCA is a powerful technique for dimensionality reduction. To illustrate this, let’s use the MNIST dataset as an example and apply PCA. By plotting the graph between “% of variance explained” and “number of components,” we can determine how many features are needed to capture the majority of the dataset’s variance. For instance, if 100 features account for 96% of the variance, eliminating 50 features would be computationally efficient while still retaining a significant amount of information for model interpretation.

% of variance explained vs. number of components

If you observe the elbow curve, you’ll notice that at around 350 components, we can capture over 95% of the variance. Based on this insight, let’s proceed to build two models: one utilizing all 784 features and the other using only the top 312 features.

So, even with just half of the features, we observe only a 0.03% reduction in accuracy, which is acceptable. It wouldn’t be practical to use an additional 472 dimensions for just a marginal 0.03% improvement. Working with such a large number of dimensions would be computationally expensive, especially for large datasets.

Therefore, while utilizing PCA for visualization is indeed useful, it’s essential to mention that techniques like t-SNE and UMAP can do a phenomenal job in this regard, as I’ll explore in my upcoming blogs. However, for dimensionality reduction, PCA provides a valuable trade-off between computational efficiency and maintaining a high level of accuracy.

I hope I have provided you with valuable insights and a better understanding of this concept, as I’m aware. Thank you :)