Dimensionality reduction methods

--

In the data science field, it is common to work with highly-dimensional datasets. This problem is also related to the so-called “Curse of Dimensions,” and in this article, I will share with you my main tools and methods for working with such datasets.

First, let’s briefly recall when this is the case. In data scientist jargon, highly-dimensional data refers to a problem where the number of attributes (columns) is greater than or close to the number of samples (rows). For example, in natural data (bio and chemical), the number of attributes can be tens of thousands, while the number of samples is only about 300.

P.S. Throughout the article I will use next terms interchangeably: attributes == features == factors == columns; samples == observations == rows.

Why is that a problem?

The main concern is the overfitting problem that would result in a terrible out-of-sample performance. In other words, it means that the algorithm will “learn by heart” the train data and would not be able to apply the same logic to the test data. This problem can be visualized as shown in the figures below where the b) figure is the least desirable (“learn by heart”) case for us in terms of highly-dimensional data.

Having tens of thousands of columns does not give us an option to know each and every single attribute, and we have to go in blind since it is very time-consuming to study and try to understand the whole dataset. For that reason, we might have other issues regarding highly-dimensional data:

  • Duplicated attributes: for example, we might have a temperature in Celsius and Fahrenheit as two different columns, but in fact, they are the same.
  • Mathematical transformations of one or several columns: let’s say column “x_6” contains continuous variables, and column “x_67389” is a log transformation of “x_6.” One of the columns must be eliminated.
  • Factors that are redundant or not related to the task must be dropped.

As you can see, there can be lots of different situations and “not-seen-before” caveats since every dataset is unique. Below, I will show you my plan and set of tools 🛠️ for encountering this problem.

Strategy

For simplicity let’s break it down and split it into baseline and advanced methods. Thus, we will start with the baseline techniques and smoothly move to the hardcore part.

Along the way, there will be an implementation of each method in R language on a real data set — Parkinson’s disease classification data set [1] from the UCI data repository.

The data set has 756 observations and 754 features (remember this number) which is exactly the above-described problem! In order to start working with the data we need to do some preparations — convert all columns to numeric class (the data set has only a character class of columns).

Let’s Start!!

Delete NA columns

Deleting NA columns might sound too simple but sometimes it really helps to start and reduce the number of attributes. An extension of this method is so-called thresholding and applied if the number of NA instances prevails in a column. For example, one can set a condition to delete a column if the number of NA instances is greater than 90%. Unfortunately, this method has some caveats. These left 10% can be extremely (or not) valuable for the future ML model and it is important to spend some time investigating whether it is a good decision to delete them. Another problem is that one should somehow set up a threshold of 90% (smaller or greater) and should have a reasonable argument for such criteria.

Delete zero-value columns (not always applicable)

The second approach is not applicable for all cases because “0” doesn’t mean NA and vice versa. Thus, this approach can only be used if having ‘0’ values is atypical for the given goal and data set. For example, we can confidently delete instances where height = “0”. Each data set is unique and should be carefully investigated at first.

As with the previous method, we can set up a threshold if the number of “0” values in a column exceeds a certain limit, but we should consider the possible importance of the remaining instances and have a reasonable argument for the chosen threshold.

Let’s assume for the sake of simplicity that this is a reasonable tactic.

In our example, this method didn’t work and we still have 754 features.

Drop the column if the standard deviation is very small

In other words — drop the redundant columns. How does it work? — we simply need to write a condition like: delete the column if its standard deviation is equal to or less than 10% (can be any number) of its mean.

Is it workable? — yes. The reason lies at the core of the dimensionality reduction idea. We reduce the number of features in order to obtain a final data frame ONLY with the most valuable attributes. If the standard deviation is small it means that most of the instances are similar or even the same. Consequently, a model won’t benefit from such an attribute since the variation of the column is small which is pretty much the same as having exactly the same value throughout the samples in a data set. Thus, the model won’t benefit from it → it is redundant → drop the feature.

As you might have already noticed in this method we again have the ‘threshold’ parameter which is specified by a person and thus is biassed immediately. Similarly to the above methods we need to be careful while setting up the threshold.

The code is similar to the above one.

This step helped us to omit 32 columns. Now df has 722 features.

Delete highly correlated columns

Highly-dimensional data usually implies having duplicated and very similar attributes that are not necessary and moreover can interfere while the modeling process. This is also known as a multicollinearity problem where two attributes have a higher correlation between each other than each of them separately with a target variable.

To understand it better let’s imagine 2 factors x1 and x2 and a target variable y. We might have the next situation:

The correlation of the 2 factors is higher (0.90) than each of them separately with a target variable. It can be a potential problem when it comes to a modeling process and thus there are two options for handling this situation:

  • delete both factors but in this case, we might lose a piece of information
  • delete the factor that has a lower correlation with the target variable (x_2 in our example)

I personally prefer the second option and usually, it’s considered to be the best practice but it is always an open question since each data set is unique.

And magic happened! The data frame has 387 features instead of 722.

The four methods above can really help you most of the time to reduce the dimensions and pick the most valuable features but there might be a case when it is still not enough and one needs more advanced methods. In such a way, we are moving to advanced methods that I personally use in my daily work. Please note, that we won’t go deep into details (each method deserves its own article) but rather scratch the surface to be able to implement it immediately.

Feature selection with Random Forest (RF)

This algorithm is amazing and I am using it all the time since it helps not only to find non-linear relationships but also to show which attributes are the most important ones.

If you would like to dive deeper and understand the idea behind the feature selection method, check this out [2]. Also check [3], if you want to refresh your knowledge about Random Forest in general.

The step-by-step algorithm:

  1. Run RF and print the accuracy results;
  2. Normalize variables (0–100) and plot the variable importance. Normalization is needed only to visualize better the factors and their importance;
  3. Set up a threshold with the condition to omit certain attributes (for example, delete attributes whose importance is less than 60%);
  4. Run RF again to make sure that accuracy performance didn’t decrease.

After getting accuracy and confusion matrix parameters we can see the “importance” of features:

The column ‘cuts’ represents the importance ranges where 0 — not important and 100 — very important. Column ’n’ shows the number of features that belongs to an exact range of importance. Thus, 372 features which are 96% !!!! have ONLY 0–5% of importance. And only 13 columns have importance greater than 5%. In our case, I propose to set up a threshold of 5%. In other words, delete all columns with “importance” lower than 5%.

RF algorithm helped to drop 96% of irrelevant columns but still, one caveat is that the threshold must be specified manually and thus is already biassed. Therefore, we should be careful with defining this parameter.

The next step is to run RF again with the reduced number of columns to make sure that the data manipulations did not affect the accuracy results. If it did → play with different thresholds and check accuracy again.

Boruta

The good news about the Boruta algorithm is that it was created to avoid this issue with the threshold parameter in RF and thus provides a more robust solution.

In short, the Boruta algorithm creates shadow features. For example, the original feature is ‘volume’ and shadow — ‘volume_shadow’. So, the algorithm doubles the number of features and the next RF is fitted.

How does it decide which attributes to take? After fitting RF and having variable importance results, Boruta scans for the highest variable importance value among shadow parameters and sets it as a threshold (threshold = highest variable importance value among shadow parameters). The original attributes are considered ‘important’ only if the original attributes’ value is greater than the threshold.

Again, if you want to familiarise yourself with the algorithm and get a more detailed explanation, please, follow this [4].

Thus, it can suggest which attributes to drop and which to keep. Moreover, the threshold parameter is no longer biased, and we no longer have to manually set conditions or provide justifications for our selections. However, one potential issue with this method is that it can be time-consuming, as Boruta doubles the number of parameters and RF may require hyper-parameter tuning.

When dealing with tentative attributes, there are two options:

  1. we can either drop them
  2. or use the TentativeRoughFix function, which identifies which attributes are important and which are not.

Thus, by applying the Boruta algorithm we reduced the data set by 67% and have only 127 columns.

But note, that it took about 9 minutes to complete the algorithm having only 756 rows and 387 columns.

Clusters (k-means clustering)

This algorithm is a type of unsupervised learning and can group attributes by their common features without a training phase.

The idea of using k-means is very simple since it identifies the centroids of each cluster in such a way that all attributes in a cluster should share the same similarities and thus gives us a way to pick the most suitable and important parameters.

For example, in the figure above we see 30 clusters and each of them has lots of attributes inside. From each cluster, we can take, for example, only 4 attributes and thus, reduce the number of columns.

k-means is also a very powerful method but one should determine the best number of attributes to keep from each cluster which can be a non-easy task. In this case, one can use non — analytical approach and discuss the best set of features that can be taken from the k-means clustering analysis with experts (for example, if it is a marketing field — have a talk with the marketing department; if it’s chemical data — talk with lab technicians and so on…).

Conclusion

There are many useful techniques for dimensionality reduction that can be used independently or in combination with other techniques. Additionally, with data science, we can combine the benefits of both a pure data-driven and a non-analytical approach.

Implementing the basic and advanced techniques described above is my personal experience in dealing with highly-dimensional data. The data-driven methodology is more prevalent in baseline techniques. While the results of more sophisticated techniques can be discussed with experts (non-analytical approach) to get the greatest answers, choices, and results.

Sources:

  1. Sakar, C.O., Serbes, G., Gunduz, A., Tunc, H.C., Nizam, H., Sakar, B.E., Tutuncu, M., Aydin, T., Isenkul, M.E. and Apaydin, H., 2018. A comparative analysis of speech signal processing algorithms for Parkinsons disease classification and the use of the tunable Q-factor wavelet transform. Applied Soft Computing, DOI: [Web Link]
  2. Reif, David & Motsinger, Alison & Mckinney, Brett & Crowe, James & Moore, Jason. (2006). Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. 1–8. 10.1109/CIBCB.2006.330987.
  3. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. doi: 10.1023/A:1010933404324
  4. Kursa, M.B., Jankowski, A., & Rudnicki, W.R. (2010). Boruta — A System for Feature Selection. Fundam. Informaticae, 101, 271–285.

--

--

Kate Nazarova | Data Science Mentor

Data Scientist | Info based on my 🌍 Coaching sessions with Executives and daily work experience | Nerd & Enthusiast with 4 tech. degrees 🤓