The what, the why and the how
In this article, we will embark on a journey into the lands of Dimensionality Reduction. The first station in our journey would be the town of WHY. From there, we will take the ferry to the town of WHAT, and we will end our journey in the town of HOW. Also, before saying adieu, there’s a little surprise for you all, and I am really hoping that you guys will surely like it.
It often happens that before understanding what a concept is, if we explore the factors that lead to the emergence of the concept first, we get a better intuition and an in-depth understanding of the concept itself, and that is, what we will be doing exactly.
Most of the real world datasets that we encounter in day to day life are high dimensional, often consisting up-to millions of features. When we have a high dimensional dataset, we encounter a ton of problems while processing it. Some of them includes:
- One of the strongest virtues of human beings is the ability to visualize. Higher dimensional datasets make human beings devoid of that virtue, as human beings can only visualize things up to 3 dimensions.
- In many of the ML models, it can be seen that the space and time complexities required at train time and run time are directly proportional to the number of features in the dataset. In simple words, higher the dimensionality of the dataset, higher will be the order of space and time complexities.
- A high-dimensional dataset increases the variance in the data , which can lead to over-fitting of the model, i.e. the model though will have a good performance on the training dataset, but it will perform very poorly on the test dataset.
- In many real-world applications, we value performance over precision. In such applications as well, a high dimensional dataset can be a big hurdle.
- The features in a high dimensional dataset are often multi-collinear which degrades the predictions of the models to a great extent.
- Minkowski distances (which can be considered as a generalization of Euclidean distance and Manhattan distance) loses it’s interpretability in higher dimensional datasets. So, all the models which heavily rely on distance-based metrics face a severe blow when they encounter a high-dimensional dataset.
The above key-points highlight some of the major issues that we have to face when we are dealing with a high dimensional dataset, and are collectively known as the Curse of Dimensionality. Here steps in our knight in shining armor 💂🏼, the notion of dimensionality reduction. Without any further ado, let’s move on to it.
In order to tackle all the above issues and many more similar issues, we use dimensionality reduction. According to Wikipedia, it is defined as the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Lost, huh? 🤔 Let’s define it once more. It is simply the transformation of our data from a higher-dimensional space into a lower-dimensional space while preserving as much information as possible.
Now the next natural question arises here is, “All that sounds to be amazing, but how is it done? Is it some sort of magic?” ✴️ And that is the exact question we will be answering in the very next section.
Now, comes the most important question, “How?”. Some of you might wonder as to why it is the most important question. The reason is simply because there is no single answer to this question.
The Machine Learning community boasts off a serious wealth of dimensionality reduction techniques. In fact so many that if I describe each and every one of them, this article would transform itself from a low-dimensional article into a high-dimensional thesis 😂.
Hence, in this article, I have decided to just explore the outer areas of many of those techniques. But as I mentioned previously, there’s a surprise waiting for you all and so, it’s SURPRISE TIME.
I will be writing detailed articles on some of these techniques soon, and for the others, I will be including the best possible resources for you all to grasp each of them in detail.
Though before commencing with our tour, I would like to mention one very important thing. The below list of techniques is not exhaustive. Though I have tried to include as many techniques as possible, but still there are many more techniques that exist out there. Now, let’s commence the marathon.
Missing Value Ratio
The idea behind it is pretty simple. For each of our features we calculate the Missing Value Ratio, which can be simply calculated as (Number of missing values / Total number of observations) * 100, and then we set a threshold. Now, we simply eliminate those features which have a higher missing value ratio than the threshold. For a more detailed understanding and implementation of missing value ratio, refer to this article.
Low Variance Filter
This technique is pretty simple too. We simply calculate the variance of all the features in our dataset, and then we drop all those features which have their variance below a certain threshold, and once again, the choice of the threshold is completely subjective. For a more detailed understanding and implementation of low variance filter, refer to this article.
High Correlation Filter
In this technique, we simply find out the correlation among all our numerical features. If the correlation coefficient crosses a certain threshold value, we can drop one of the features. The choice of the feature that we need to drop is completely subjective. For more details, refer to this article.
Though it is a tree-based model which is used for regression and classification tasks on non-linear data, but it turns out that it can also be used for feature selection with its built-in feature_importances_ attribute which calculates feature importance scores for each feature. For a detailed understanding of random forest model, refer to this article.
Principal Component Analysis (PCA)
In PCA, we basically extract new variables from the existing variables, also known as Principal Components, where each of the principal component is a linear combination of the original features. Also, they are extracted in decreasing order of the variance explained by each of them. We use the mathematical notion of eigenvectors and eigenvalues to calculate the principal components.
For an unforgettable understanding of PCA, refer to this Stack Exchange thread. This is one of the best explanations that I have come across for any topic, and not just PCA. For it’s implementation, refer to this article.
Independent Component Analysis (ICA)
It is one of the most widely used techniques for dimensionality reduction, and it is based on information theory. The major difference between PCA and ICA is that PCA looks for uncorrelated factors while ICA looks for independent factors. For it’s implementation, refer to this article.
Kernel PCA is an extension of PCA using techniques of kernel methods. Kernel PCA works well with non-linear datasets where normal PCA cannot be used efficiently. For a detailed understanding of Kernel PCA, refer to this article.
Factor Analysis (FA)
This technique works on the concept of correlations. All variables in a particular group will have a high correlation among themselves, but a low correlation with variables of other groups, and we refer to each of the groups as a factor. For a detailed understanding of Factor Analysis, refer to this article.
Linear Discriminant Analysis (LDA)
Though it is typically used for multi-class classification, but it can also be used for dimensionality reduction. It is a supervised algorithm which takes class labels into account as well. For a detailed understanding of Linear Discriminant Analysis, refer to this article.
Correspondence Analysis (CA)
Also known as reciprocal averaging, it’s a technique that is traditionally applied to contingency tables. Though it is conceptually similar to PCA, but it applies to categorical data rather than continuous data. As of now, that’s it for CA but soon I will be writing an article based on CA and MCA, in which I will be throwing some light upon contingency tables as well.
Multiple Correspondence Analysis (MCA)
It can be simply defined as an extension of CA for more than 2 categorical features. It is used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. As of now, that’s it for MCA but soon I will be writing an article based on CA and MCA.
Multiple Factor Analysis (MFA)
It seeks the common structures present in all the features. It is used when the dataset consists of a group of numerical or categorical features. This is because MFA is used to analyze a set of observations described by several group features. It may be seen as an extension of PCA, MCA and FAMD. As of now, that’s it for MFA but soon I will be writing an article solely based on this technique.
Factor Analysis of Mixed Data (FAMD)
It is used for reducing the dimensions of datasets containing both quantitative and qualitative features. It means that FAMD is applied to data with both categorical and numerical features. It can be seen as a mix of PCA and MCA. As of now, that’s it for FAMD but soon I will be writing an article solely based on this technique.
Singular Value Decomposition (SVD)
Though it is used in digital signal processing for noise reduction and image compression, but it can also be used for dimensionality reduction. It is a concept that is borrowed from the sea of linear algebra. For a detailed understanding of Singular Value Decomposition, refer to this article.
Truncated Singular Value Decomposition (SVD)
Truncated SVD is different from regular SVDs in that it produces a factorization where the number of columns is equal to the specified truncation. It works well with sparse data in which many of the row values are zero. As of now, that’s it for Truncated SVD but soon I will be writing an article based on SVD and Truncated SVD.
t-Distributed Stochastic Neighborhood Embedding (t-SNE)
It is a non-linear dimensionality reduction technique as well just like Kernel PCA, and is mostly used for the purposes of data visualization. In addition to that, it is also widely used in image processing and NLP. For a detailed understanding of t-Distributed Stochastic Neighborhood Embedding, refer to this article.
Multidimensional Scaling (MDS)
Just like t-SNE, MDS is another non-linear dimensionality reduction technique. It tries to preserve the distances between instances while reducing the dimensionality of non-linear data. For a detailed understanding of Multidimensional Scaling, refer to this article.
Uniform Manifold Approximation and Projection (UMAP)
It is a dimensionality reduction technique that can be used for data visualization purposes just like t-SNE. However, it can also be used for dimensionality reduction of non-linear datasets. As of now, that’s it for UMAP but soon I will be writing an article solely based on this technique.
Isometric Feature Mapping (Isomap)
Just like t-SNE, it is also used for dimensionality reduction of non-linear datasets. It can be seen as an extension of MDS or Kernel PCA. For a detailed understanding of Isometric Feature Mapping, refer to this article.
Locally Linear Embedding (LLE)
It is an unsupervised method for dimensionality reduction. It tries to reduce the number of features while trying to preserve the geometric features of the original non-linear feature structure. For a detailed understanding of Locally Linear Embedding, refer to this article.
Hessian Eigenmapping (HLLE)
It projects data to a lower dimension while preserving the local neighborhood like LLE but uses the Hessian operator to better achieve this result and hence the name. As of now, that’s it for HLLE but soon I will be writing an article solely based on this technique.
Spectral Embedding (Laplacian Eigenmaps)
It uses spectral techniques to perform dimensionality reduction by mapping nearby inputs to nearby outputs. It preserves locality rather than local linearity. For a detailed understanding of Spectral Embedding, check out my article on the same.
Backward Feature Elimination
This technique removes features from a dataset through a recursive feature elimination (RFE) process. The algorithm starts with the initial set of features and keeps on eliminating features until it detects a negligible change in the performance score. For a detailed understanding of backward feature elimination, refer to this article.
Forward Feature Selection
This method can be considered as the opposite process of backward feature elimination. Instead of eliminating features recursively, it adds features recursively. It starts with individual features and keeps on adding features until it detects a negligible change in the performance score. For a detailed understanding of forwards feature selection, refer to this article.
These are a type of artificial network that aims to copy their inputs to their outputs. They compress the input into a latent-space representation, and then reconstructs the output from this representation. For a detailed understanding of Auto Encoders, refer to this article.
Phew! Before writing this article I never imagined in my wildest dreams that there would be so many dimensionality reduction techniques. And I would definitely update this article whenever I will encounter any dimensionality reduction technique that is not included in the above list.
And if any of you come across any dimensionality reduction technique which is not included in the above list, do let me know either in the comments section, or you can ping me directly.
Also, I would like to mention one additional thing. As can be seen from the above list, I have mentioned that I will be writing articles on many of these techniques individually, but if I find a resource that explains any of these above techniques in depth, then in place of writing an article myself, I would be just including the resource so that all of us can benefit.
A little about ME 👋
You can safely skip this section, if you have no interest in knowing the author, or you already know me. I promise that there is no hidden treasure in this section 😆.
I am a Machine Learning and Deep Learning Enthusiast, and this is my first piece of content based on the same. If you like it, do put your hands together 👏 and if you would like to read further articles based on Machine Learning and Deep Learning #StayTuned.
Thanks a lot guys, for making this journey possible. It was a fun one indeed, and will be coming back soon with another adventurous trip.