Linear Discriminant Analysis

5 min readAug 21, 2023

Linear Discriminant Analysis or LDA is another reduction dimensionality reduction technique for feature extraction. LDA operates in classifications, where the objective of the feature extraction transformation is to highlight the discriminative class information in a lower-dimensional space.

The solution proposed by Fisher is to maximize a function that represents the difference between the means, normalized by a measure of the dispersion within the classes.

Let’s assume we have a dataset with n data points and d features. Each data point belongs to one of k classes.

The mean for each class:

Then we will need to calculate Scatter Matrices.

Within-Class Scatter Matrix S𝓌, which measures the spread of data within each class and is computed as the sum of the individual scatter matrices for each class.

Between-Class Scatter Matrix SB, which quantifies the separation between class means and is computed as the sum of the outer product of the difference between class means and the overall mean.

We are looking for a projection that maximizes inter-class dispersion and minimizes intra-class dispersion.

Using Fisher’s rule, finding the optimal projection directions comes from solving an optimization problem:

Where λ are the eigenvalues and W are the eigenvectors from the previous form.

The number of non-zero eigenvalues is at most (k − 1), where k is the number of classes. These eigenvalues represent the directions of maximum class separability.

After we need to sort the eigenvectors based on their corresponding eigenvalues in descending order. We will choose the top (k − 1) eigenvectors to form the transformation matrix W.

Multiplying the original data matrix X (dimensions (n x d)) with the transformation matrix W (dimensions (d x (k − 1))) to get the transformed data matrix X(LDA) (dimensions (n x (k − 1))).

The transformed data X(LDA) is the reduced representation obtained by LDA, which maximizes the class separability while minimizing the number of dimensions. This new representation can then be used for classification or visualization purposes in supervised learning tasks.

LDA limitations

LDA fails when the discriminatory information is not in the mean but in the variance of the data.

It requires that S𝓌 is nonsingular, that means that

LDA is a parametric method as it implicitly assumes unimodal Gaussian distributions. If the distributions deviate from being Gaussian, LDA projections may not be able to preserve any complex structure in the data, which could be necessary for classification.

LDA produces at most (k − 1) projected features. If the estimated classification error is too high, we will need more features, and we will have to use another method that provides those additional features.

Python Example

We are going to use the wine dataset. We’ll start loading the packages and data.

We’ll follow with some information about the data and descriptions.

We separate the target column from the rest of the predictor variables.

Now we are going to calculate the mutual information between the features and the target variable using mutual_info_classif from the FS (feature selection) module.

This function is commonly used in feature selection to estimate the mutual information between each feature and the target variable. Mutual information measures the dependence between two variables and indicates how much information about one variable can be obtained from the other.

These scores can be used to determine the relevance and importance of each feature in predicting the target variable, aiding in feature selection or model building. Then we’ll obtained a ranked list of features based on their mutual information scores.

Now we are going to repeat the same but using mutal_info_regression instead. This is the same outcome but it’s more suitable for regression problems. We’ll check just in case we are missing any information.

The outcomes are really similar, but we can see that the variable proline is more important in the correlation approach than in the regression one.

We’ll create a few plots to see if the results from the test make sense.

We will do the same but using Chi-Square test instead. It then sorts the chi-squared statistic scores in descending order, prints the sorted indices, and displays the index, chi-squared statistic, and name of each feature based on the sorted order. Features with higher chi-squared statistics have a stronger association with the target variable.

Linear Discriminant Analysis (LDA) in python with sklearn

Subsequently, we will create plots to visualize how the components separate the data based on the different classes.

With this we have verified that LDA is invariat to scale.

Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known classes labels.

References

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning. Springer Texts in Statistics. 2013.
T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics.

Linear Discriminant Analysis

LDA limitations

Written by Ana Belén Manjavacas