DATA SCIENCE THEORY | DIMENSIONALITY REDUCTION | KNIME ANALYTICS PLATFORM

Three More Techniques for Data Dimensionality Reduction in ML

Implementing LDA, Neural Autoencoder, and t-SNE in a codeless fashion

Maarit Widmann
Low Code for Data Science

--

As first published in The New Stack. Co-author: Rosaria Silipo

Photo by Diana Polekhina on Unsplash.

The full big data explosion has convinced us that more is better. While it is of course true that a large amount of training data helps the machine learning model to learn more rules and better generalize to new data, it is also true that an indiscriminate addition of low-quality data and input features might introduce too much noise and, at the same time, considerably slow down the training algorithm.

So, in the presence of a dataset with a very high number of data columns, it is good practice to wonder how many of these data features are actually really informative for the model. A number of techniques for data-dimensionality reduction are available to estimate how informative each column is and, if needed, to skim it off the dataset.

In “Seven Techniques for Data Dimensionality Reduction”, we provided a review of the seven most commonly used techniques for data-dimensionality reduction, including:

  • Ratio of missing values
  • Low variance in the column values
  • High correlation between two columns
  • Principal component analysis (PCA)
  • Candidates and split columns in a Random Forest
  • Backward feature elimination
  • Forward feature construction

Those are traditional techniques commonly applied to reduce the dimensionality of a dataset by removing all of the columns that either do not bring much information or no new information. Since then, we have started to use three additional techniques, also quite commonly used, and have decided to add them to the list as well.

  1. Linear discriminant analysis (LDA)
  2. Neural autoencoder
  3. t-distributed stochastic neighbor embedding (t-SNE)

The Dataset

In our first review of data dimensionality reduction techniques, we used the two datasets from the 2009 KDD Challenge — the large dataset and the small dataset. The particularity of the large dataset is its very high dimensionality with 15,000 data columns. Most data mining algorithms are implemented column-wise, which makes them slower and slower as the number of data columns increases. This dataset definitely brings out the slowness of a number of machine learning algorithms.

The 2009 KDD Challenge small dataset is definitely lower dimensional than the large dataset but is still characterized by a considerable number of columns: 230 input features and three possible target features. The number of data rows is the same as in the large dataset: 50,000. In this review, for computational reasons, we will focus on the small dataset to show just how effective the proposed techniques are in reducing dimensionality. The dataset is big enough to prove the point in data-dimensionality reduction and small enough to do so in a reasonable amount of time.

Let’s proceed now with the (re)implementation and comparison of 10 state-of-the-art dimensionality reduction techniques, all currently available and commonly used in the data analytics landscape.

Three More Techniques for Data Dimensionality Reduction

Let’s implement the three newly added techniques:

  1. Linear Discriminant Analysis (LDA)
  2. Neural autoencoder
  3. t-distributed Stochastic Neighbor Embedding (t-SNE)

Linear Discriminant Analysis (LDA)

A number m of linear combinations (discriminant functions) of the n input features, with m < n , are produced to be uncorrelated and to maximize class separation. These discriminant functions become the new basis for the dataset. All numeric columns in the dataset are projected onto these linear discriminant functions, effectively moving the dataset from the n-dimensionality to the m-dimensionality.

In order to apply the LDA technique for dimensionality reduction, the target column has to be selected first. The maximum number of reduced dimensions m is the number of classes in the target column minus one, or if smaller, the number n of numeric columns in the data. Notice that linear discriminant analysis assumes that the target classes follow a multivariate normal distribution with the same variance but with a different mean for each class.

Autoencoder

An autoencoder is a neural network, with as many output units (n) as input units, at least one hidden layer with m units where m < n, and trained with the backpropagation algorithm to reproduce the input vector onto the output layer. It reduces the numeric columns in the data by using the output of the hidden layer to represent the input vector.

The first part of the autoencoder — from the input layer to the hidden layer of m units — is called the encoder. It compresses the n dimensions of the input dataset into an m-dimensional space. The second part of the autoencoder — from the hidden layer to the output layer — is known as the decoder. The decoder expands the data vector from an m-dimensional space into the original n-dimensional dataset and brings the data back to their original values.

t-distributed Stochastic Neighbor Embedding (t-SNE)

This technique reduces the n numeric columns in the dataset to fewer dimensions m (m < n) based on nonlinear local relationships among the data points. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points in the new lower dimensional space.

In the first step, the data points are modeled through a multivariate normal distribution of the numeric columns. In the second step, this distribution is replaced by a lower dimensional t-distribution, which follows the original multivariate normal distribution as closely as possible. The t-distribution gives the probability of picking another point in the dataset as a neighbor to the current point in the lower dimensional space. The perplexity parameter controls the density of the data as the “effective number of neighbors for any point.” The greater the value of the perplexity, the more global structure is considered in the data. The t-SNE technique works only on the current dataset. It is not possible to export the model to apply it to new data.

Comparison in Terms of Accuracy and Reduction Rate

We implemented all 10 described techniques for dimensionality reduction (described in “Seven Techniques for Data Dimensionality Reduction” and in this article), applying them to the small dataset of the 2009 KDD Cup corpus. Finally, we compared them in terms of reduction ratio and classification accuracy. For dimensionality reduction techniques that are based on a threshold, the optimal threshold was selected by an optimization loop.

For some techniques, final accuracy and degradation depend on the selected classification model. Therefore, the classification model is chosen from a bag of three basic models as the best-performing model:

  • Multilayer feedforward neural networks
  • Naive Bayes
  • Decision tree

For such techniques, the final accuracy is obtained by applying all three classification models to the reduced dataset and adopting the one that performs best.

Overall accuracy and area under the curve (AUC) statistics are reported for all techniques in Table 1. We compare these statistics with the performance of the baseline algorithm that uses all columns for classification.

Seven plus three Techniques for Dimensionality Reduction
Table 1: Number of input columns, reduction rate, overall accuracy, and AuC value for the 7 + 3 dimensionality reduction techniques based on the best classification model trained on the KDD challenge 2009 small dataset.

A graphical comparison of the accuracy of each reduction technique is shown in Figure 1 below. Here all reduction techniques are reported on the x-axis and the corresponding classification accuracy on the y-axis, as obtained from the best-performing model of the three basic models proposed above.

Three new techniques for data dimensionality reduction
Fig. 1. Accuracies of the best performing models trained on the datasets that were reduced using the 10 selected data dimensionality reduction techniques.

The receiver operating characteristic (ROC) curves in Figure 2 show a group of best-performing techniques: missing value ratio, high correlation filter and the ensemble tree methods.

Seven plus Three Techniques for Dimensionality Reduction
Fig. 2: ROC Curves showing the performances of the best classification model trained on the reduced datasets: each dataset was reduced by a different dimensionality reduction technique.

Implementation of the 7+3 Techniques

The workflow that implements and compares the 10 dimensionality reduction techniques described in this review is shown in Figure 3. In the workflow, we see 10 parallel branches plus one at the top. Each one of the 10 parallel lower branches implements one of the described techniques for data-dimensionality reduction. The first branch, however, trains the bag of classification models on the whole original dataset with 230 input features.

Each workflow branch produces the overall accuracy and the probabilities for the positive class by the best-performing classification model trained on the reduced dataset. Finally, the positive class probabilities and actual target class values are used to build the ROC curves, and a bar chart visualizes the accuracies produced by the best-performing classification model for each dataset.

You can inspect and download the workflow from the KNIME Hub.

Seven plus Three Techniques for Dimensionality Reduction
Fig. 3: Implementation of the ten selected dimensionality reduction techniques. Each branch of this workflow outputs the overall accuracy and positive class probabilities produced by the best-performing classification model. An ROC Curve and bar chart then compare the performance of the classification models trained on the reduced datasets. The workflow can be downloaded and inspected from the KNIME Hub.

Summary and Conclusions

In this article, we have presented a review of ten popular techniques for data dimensionality reduction. We have actually expanded a previous existing article describing seven of them (ratio of missing values, low variance in the values of a column, high correlation between two columns, principal component analysis (PCA), candidates and split columns in a Random Forest, backward feature elimination, forward feature construction) by adding three additional techniques.

We trained a few basic machine learning models on the reduced datasets and compared the best-performing ones with each other via reduction rate, accuracy and area under the curve.

Notice that dimensionality reduction is not only useful to speed up algorithm execution but also to improve model performance.

In terms of overall accuracy and reduction rate, the Random Forest based technique proved to be the most effective in removing uninteresting columns and retaining most of the information for the classification task at hand. Of course, the evaluation, reduction and consequent ranking of the ten described techniques are applied here to a classification problem; we cannot generalize to effective dimensionality reduction for numerical prediction or even visualization.

Some of the techniques used in this article are complex and computationally expensive. However, as the results show, even just counting the number of missing values, measuring the column variance and the correlation of pairs of columns — and combining them with more sophisticated methods — can lead to a satisfactory reduction rate while keeping performance unaltered with respect to the baseline models.

Indeed, in the era of big data, when more is axiomatically better, we have rediscovered that too many noisy or even faulty input data columns often lead to unsatisfactory model performance. Removing uninformative — or even worse — misinformative input columns might help to train a machine learning model on more general data regions, with more general classification rules, and overall with better performances on new, unseen data.

--

--

Maarit Widmann
Low Code for Data Science

I am a data scientist in the evangelism team at KNIME; the author behind the KNIME self-paced courses and a teacher at KNIME.