Dimensional Reduction with LDA

Shih-Chieh Chen
The Startup
Published in
12 min readFeb 13, 2021

Maybe you are a data scientist given the task of predicting different types of tumor cells based on gene expression or maybe you are looking at data from multiple sensors across multiple time periods in a hydraulic system.

These are just two of many situations where a data scientist is faced with a data set containing massive numbers of features. Not only are these data sets hard to analyze, but there may also be unnecessary variables mixed in that could hinder your analyses.

Problems with higher dimensions

One problem with higher dimensional data is how difficult they are to visualize. We can easily plot a dataset with 2 or 3 features. But with only three dimensions to our world, any more would require multiple plots. High dimensional data can quickly become difficult to visualize. Imagine graphing a data set with 100 features!

Another issue with higher dimension data is that as the number of features grows, the data points required to properly populate the features grows. To visualize, with 2 features, your data points are trying to populate a 2-dimensional space. With 3 features, the required number of points to populate the space greatly increase. This is further exasperated in higher dimensions.

Therefore, when you have high dimensional data but not enough data points to populate those dimensions, it is extremely important that your analysis or model utilizes the most important features.

Lastly, not all our features are useful. Sometimes, features are just noise that hinders our analyses and model building. If we can reduce the number of dimensions, or even extract more useful features from the ones we already have, it could greatly improve our analyses and model building.

What is LDA?

LDA, or linear discriminant analysis, is the method of finding features that maximize the variability between 2 or more classes of objects.

For example, imagine that you have collected a series of data with 2 classes, red and blue. Each data point has two features so we can plot them out as a scatter plot.

Now, imagine that we draw a line and project the data points onto that line. The new line becomes a new feature and each data point gets projected onto the line.

While there can be multiple lines, only a few of them can best separate the dots.

In this case, it is the right plot that has a line that best separates the dots. In practice, we can have an infinite number of lines and LDA looks for the best line that can separate the dots. In other words, we are maximizing the variability between the two classes. But maximizing the variability of the classes is not enough if the variability within each class is high. So, LDA seeks to find a balance between in-class variability and between-class variability.

Mathematical Assumptions of LDA

LDA is the generalization of Fisher’s linear discriminant so LDA contains assumptions based on Fisher’s linear discriminant. The assumptions of LDA are:

  • Multivariate normality: the density of the whole population is under multivariate Gaussian distribution with all classes
  • Homogeneity of variance-covariance: The variance is constant among classes
  • Multicollinearity: increasing correlation will lower the classification power
  • Independence: data points are independent of each other and sample data is randomly selected.

LDA and Naive Bayes

Based on the assumptions above, we need help from Naive Bayes to compute LDA. In Naive Bayes, all variables are independent and densities of classes are products of marginal densities. It helps to compute the posterior probability. The formula for calculating posterior probability is:

The Naive Bayes classification problem is to find the maximum of:

Mathematical Magic behind LDA

Assume we have a population of p dimensions, with probability density function of every class h, following a Gaussian distribution. The formula for the multivariate Gaussian distribution is:

(1)

Based on the Naive Bayes classification above, we want to take the population to get the maximum likelihood of 𝑃(𝜋ₕ )𝑃(𝑋 |𝜋ₕ ). Since the log function is an increasing function and it will make our calculation easier, we take log transformation. We can set the function to 0 to get maximum.

(2)

The first log part is constant so we can remove that part as it will not affect the final outcome, then we only need to maximize the rest of the function (2).

(3)

Note that the first part of the function (3) can be further simplified as below:

(4)

This gives the linear discriminant function.

(5)

As we notice the unknown parameters that we need for LDA are prior probability 𝜋ₕ, population mean 𝜇ₕ, and variance-covariance matrix ∑.

LDA applications

After you learned the principle behind the LDA, let’s go through some application cases to have a more vivid experience of its magical power. The applications of LDA mainly includes four parts: data visualization, classification, feature extraction, and data compression. GitHub

LDA for data visualization

Dataset

The dataset here we used is UCI Iris Data Set, in which each class refers to a type of iris plant. Four features were measured from each sample: the length and the width of the sepals and petals.

When a dataset has more than three features, since normally, we can only plot 2D or 3D points, we need to draw multiple plots to illustrate all the features, which is very unintuitive. As we showed below, in the case of Iris Data Set, we need six 2D plots and four 3D plots to include all the features.

Experiments

Now, we applied LDA to reduce the dimensionalities of the Iris Data Set from four to two. In this way, we can display the data with a 2D plot. As we can see from the plot, the result of LDA is fairly good in that data points from different classes are separated.

LDA for classification

Dataset

Here we used a part of MNIST, a large database of handwritten digits.

Experiments

As we mentioned in the data visualization parts, unlike random projection, the result of LDA enjoyed a strength that data points from different classes are separated. Taking advantage of this strength, we can use LDA on the classification tasks such as the handwritten digit recognition problem, with the help of some simple approaches like K-NearestNeighbor(KNN).

LDA for feature extraction

Dataset

In this case, we used UCI Image Segmentation Data Set. The images were hand-segmented to create a classification for every pixel.

Experiments

There are 19 features in this dataset. Firstly, let’s train the Random Forest model with all the features.

applyrandomforest(trainX, testX, trainY, testY)

Time Elapsed: 2.372649990000001 secs
Classification Report after applying Random Forest:
----------------------------------------------------
precision recall f1-score support
0 1.00 1.00 1.00 30
1 0.97 1.00 0.98 30
2 0.97 1.00 0.98 30
3 1.00 1.00 1.00 30
4 1.00 1.00 1.00 30
5 1.00 1.00 1.00 30
6 1.00 0.93 0.97 30
accuracy
0.99 210
macro avg 0.99 0.99 0.99 210
weighted avg 0.99 0.99 0.99 210

Aiming at training a Random Forest model with good performance in the classification task, do we really need to put in all these 19 features? If not, how many features do we need at least? LDA can help us find the answers. explained_variance_ratio_ndarray of shape (n_components) in scikit-learn library told us the percentage of variance explained by each of the selected components. If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0. By plotting the results of explained_variance_ratio_ndarray, we see in this case, we can combine the original 19 features into 4 new features to describe the dataset.

From the histograms, we can see that the features generated by LDA roughly obey the normal distribution.

Then, we trained the Random Forest model with the new dataset. From the report, the performance of the new dataset was as good as the original dataset.

applyrandomforest(trainX_lda, testX_lda, trainY, testY)Time Elapsed: 1.9859107740000006 secs
Classification Report after applying Random Forest:
----------------------------------------------------
precision recall f1-score support

0 1.00 1.00 1.00 30
1 1.00 1.00 1.00 30
2 0.94 1.00 0.97 30
3 1.00 1.00 1.00 30
4 1.00 1.00 1.00 30
5 1.00 1.00 1.00 30
6 1.00 0.93 0.97 30

accuracy 0.99 210
macro avg 0.99 0.99 0.99 210
weighted avg 0.99 0.99 0.99 210

LDA for data compression

From the example of feature extraction above, since the LDA reduces the number of features, it also compressed the dataset. The more features it reduces, the larger space it saves.

In Image Segmentation Data Set, LDA reduces 73.68% of data size without affecting the classification performance.

How Does LDA Compare?

When talking about dimensional reduction, one of the most popular methods is PCA. So how does LDA compare to PCA?

PCA

First, let's do a quick refresher on PCA. The main idea of ​​PCA is a linear mapping. Let’s say that we have a data set with n features. In other words, we have a dataset with n-dimensional space. PCA linearly maps the n-dimensional space to a lower-dimensional space.

The process of PCA is to sequentially find a set of mutually orthogonal coordinate axes from the original space. The choice of new coordinate axes are closely related to the data itself.

The first coordinate axis selected is the direction with the largest variance in the original data. From the plot below, we can see that the red line does just that. It is spanning across the space that contains the most variance in our data. This is the first coordinate axis.

The second coordinate axis is selected to maximize the variance in the plane orthogonal to the first coordinate axis. In our two dimensional space, this means that the second axis is perpendicular to the first axis. The second axis spans across the space that contains the most variance while being orthogonal to the first axis. Similar steps are taken when dealing with higher dimensional datasets.

For any given data set, n such coordinate axes can be obtained. Looking at the new coordinate axis, we will find that some of the axes will contain most of the variance while other axis will contain little to zero variance. Therefore, we can focus on the axes that contain the most variance. In other words, we are retaining the dimensional features that best explain the data and ignoring the dimensional features that explain little to nothing of our data.

LDA VS PCA

So how does LDA compare? PCA and LDA have their similarities. They both use the method of ​​matrix eigendecomposition to generate their axes and they both assume that their data have a Gaussian distribution.

They also have a number of features that sets them apart:

Overall, LDA usually exists as an independent algorithm. Given the training data, a series of discriminant functions will be obtained, and then the new input can be predicted. PCA is more like a preprocessing method, which can reduce the dimensionality of the original data, and maximize the variance within each dimension. The picture below shows the comparison of LDA and PCA.

LDA is usually a good way to transform data and can be visualized using scatter plots, but the nature of this method for calculating differences between classes limits its usefulness. Therefore, there is a class of algorithms for visualization called manifold learning algorithms, which allow more complex mapping and often provide better visualization. One particularly useful is t-distributed random neighbor embedding (t-SNE). If we apply t-SNE to n-dimensional data, it will intelligently map n-dimensional data to 3d or even 2d data, and keep a very good relative similarity of the original data.

t-SNE

Now let us talk about what is t-SNE? t-SNE is a data visualization method proposed by Hinton in 2008, which is a non-linear feature extraction data visualization method. It is widely used in image processing, natural language processing and speech.

The working principle of t-SNE is that it first creates a probability distribution by selecting a random data point and calculating the Euclidean distance from other data points. Data points near the selected data point will get a higher similarity value, while data points farther away from the selected data point will get a lower similarity value. Using the similarity value, it will create a similarity matrix (S1) for each data point. Second, it will convert the calculated similarity distance into a joint probability according to a normal distribution. Through the above calculations, t-SNE randomly arranges all data points in the required lower dimension. t-SNE will once again perform all the same calculations on high-dimensional data points and randomly arranged low-dimensional data points. But in this step, it assigns probabilities according to the t distribution. The purpose of using t distribution in t-SNE is to reduce the congestion problem. But remember, for high-dimensional data, the algorithm assigns probabilities based on a normal distribution. For lower-dimensional data points, a similarity matrix (S2) will also be created. Then the algorithm compares S1 and S2, and processes some complex mathematical operations to make the difference between S1 and S2. This includes using the Kullback Leibler divergence between the two distributions as a loss function to run the gradient descent algorithm. Using KL divergence helps t-SNE preserve the local structure of the data by minimizing the value between the two distributions relative to the position of the data point. Finally, the algorithm can obtain low-dimensional data points that are relatively similar to the original high-dimensional data.

LDA VS t-SNE

Unlike PCA, t-SNE can be better applied to data sets with good linear and non-linear clustering and produce more meaningful clusters. In addition, the computational complexity of t-SNE is much higher than that of PCA. The figure below shows the comparison of LDA and t-SNE results after processing the same data set. In the figure, data points are plot using either LDA or t-SNE generated features. The data point for are replaced by numbers 1 to 5 to represent the labels. Looking at the plot, we can see that LDA can only separate the data points of label 5 and 3. The other points are clustered at the top left corner. On the other hand, t-SNE demonstrated remarkable results. It is worth noting that the t-SNE method does not require class labels and is completely unsupervised. However, it can still find a clear classification of two-dimensional data representations based on neighboring points in the original space.

Conclusion

Dimension reduction is an important tool that all data scientists should learn to exploit. It can help eliminate many problems associated with high dimensional data sets and can extract informative features that improve a models predictability. There are all sorts of dimensional reduction methods each with their pros and cons. What is most important is to be able to identify how each method compares with one another and pick the right model for the right job.

References

Comparison of LDA and PCA 2D projection of Iris dataset
Manifold learning on handwritten digits: Locally Linear Embedding, Isomap…
Feature Extraction Techniques: PCA, LDA and t-SNE
9.2 — Discriminant Analysis (psu.edu)
Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP vs LDA
PCA vs LDA vs T-SNE — Let’s Understand the difference between them!

--

--