Dimensionality Reduction and Visualization - Part-1

Published in

Applied Machine Learning

3 min readMay 31, 2020

In this series of articles, we will understand the concept of dimensionality reduction and visualization. We will start with understanding what this concept is and why it is needed, followed by understanding the 2 most widely used (and popular) dimensionality reduction techniques. PCA and t-SNE.

I work in a company where a major part of my job is to capture the necessary data points from various data sources and clean it. Once done with the cleaning (which takes a pretty good amount of my time), I spend considerable time analyzing the data.

Now, I have tried multiple ways of analyzing data.

One way is to stare at numbers and try to understand what it’s trying to tell me. Another (and interesting) way is to create graphs and plots using visualization tools (I work on Tableau) and trust me the kind of story you can create around the data using some form of visualization tool is far better than just using those numbers. Even my management team loves to hear a story that is created using charts and graphs.

We, humans, are visual creatures. So, when I have a data set with more than three dimensions, it becomes impossible to see what’s going on with our eyes as it is impossible to plot a data having more than 3 dimensions.

In real-world applications, we can (and we will) come across hundreds or even thousands of dimensions for a given data set, and our limitation (as human beings) makes it impossible to visualize that data. But who said that these extra dimensions are always necessary? Isn’t there a way to somehow reduce it to one, two, or three humanly dimensions? It turns out there is.

This process of reducing the number of dimensions for having a better visualization or removing redundant dimensions is known as Dimensionality Reduction.

Now, the next big question that pops in my mind is how do I decide which dimension is important and which is not? Which ones to keep and which ones to remove?

To do so we have several techniques but the ones that we will be covering as part of this series are:

Principal Component Analysis
T-distributed Stochastic Neighbor Embedding (t-SNE)

Before we dive deep into the concepts of the above 2 techniques, let's understand why we perform dimensionality reduction?
As discussed above, one major reason behind this is to visualize data in a humanly understandable form. Apart from that dimensionality reduction has the following advantages as well:
• As dimensions decrease, the space needed to store that data also decreases
• Less dimension leads to less computation
• Some algorithms perform well when the number of dimensions is low
• It takes care of multi-collinearity. Means getting rid of dimensions which are redundant

I will conclude this part of the series here. In the next part (Part-2) we will understand PCA in great detail. We will understand the geometric intuition behind it and the optimization function of PCA.

Until then, Happy Learning!

Edit:
Link to Part-2 of this series
Link to Part-3 of this series
Link to Part-4 of this series

Dimensionality Reduction and Visualization - Part-1

Written by Deepak Jain