Linear discriminant analysis

Reduce dimensionality and increase class separation

Risto Hinno

Published in

Nerd For Tech

5 min readMay 1, 2021

Intro

Linear discriminant analysis (LDA) is a rather simple method for finding linear combination of features that distinctively characterize members in same classes and meantime separates different classes (source). This tutorial gives brief motivation for using LDA, shows steps how to calculate it and implements calculations in python Examples are available here.

LDA

Briefly LDA tries to achieve two thing simultaneously:

group samples from one class as close as possible (reduce in-class variance)
separate samples from different classes as far as possible (increase between-class variance)

This property useful as it enables to cluster data, classify samples and/or reduce their dimensionality. Another popular dimensionality reduction algorithm principal component analysis (PCA) cares only about explaining as much of variance of whole data without regarding any information about class membership and variance between/in classes.

As seen from the previous graph PCA would be a good choice to represent whole data in lower dimensionality (for example to save memory, find most variance descriptive features), LDA on the other hand is useful clustering/classification.

Parameter learning

LDA model consists of one matrix v which projects data into low-dimensional space which has maximum between-class separation and minimum within-class separation. To calculate v we need to (source):

compute the within-class and between class scatter matrices. Example implementation uses pandas DataFrame. Within-class scatter matrix captures information about spread of the data within each class.

Between-class scatter matrix aims to gather information how classes are spread between themselves. It has similar python implementation as within-class scatter matrix.

compute the eigenvectors and eigenvalues for the scatter matrices. As we have information about the spread of the data between and inside the classes we can use it to find a matrix v that maximizes between classes spread and minimizes within classes spread. We need to maximize following criterion (source):

To find such v we could solve generalized eigenvalue problem in the form (source and source for more detailed derivation):

where:

v contains eigenvectors
Λ contains eigenvalues

And so finding those values reduces to finding eigenvectors and eigenvalues of following matrix (source):

By finding eigenvectors we’ll find axes of new subspace where our life gets simpler: classes are more separated and data within classes has lower variance.

Computation in python is straightforward:

sort the eigenvalues and select the top n. We’ll keep only the most informative axes (which we can give as an parameter for a model). Eigenvalues come handy for finding most informative axes:

Note that number of eigenvalues depends on the number of classes and number of features. At most n can be min(n_classes -1, n_features) (source).

create a new matrix v containing eigenvectors that map to n eigenvalues. This way we’ll just create a matrix for changing the basis of the data.
obtain the new features by taking the dot product of the data and the v matrix. This way data is transformed into space that in class and between class separation has properties needed.

Simple implementation to perform LDA:

More detailed tutorial could be found here. Once we have trained matrix v we can use it to transform test data to lower dimensional space.

Example

To demonstrate usefulness of LDA let’s use wine data set. Full example is visible here.

After reading in the data, I used principal component analysis (PCA) to plot data using 2 components.

As we can see from the plot PCA cares about the total variance. Component 0 explains more variance than component 1. It doesn’t take into account information about the class membership or even number of classes. It is good for representing most of the data variance. With LDA we have a different picture.

LDA uses information about the classes and does what it is supposed to do: reduces variance inside the classes and increases distance between the classes. It is now very easy to divide examples from different classes into different clusters. This may not be so useful to represent data variance as a whole.

We can see the usefulness of LDA from the classification. If we train model with and without LDA using multiple seeds for train-test split we can evaluate if LDA on average increases accuracy. Just for comparison I’ll use PCA in one example to transform features. Full example is available here.

F1-score on average without LDA is 0.978, with LDA 0.985. It is not a huge increase but still on average increases classification accuracy. With PCA our F1-score is very low 0.660.

Conclusion

LDA is useful for reducing data dimensionality in a way that increases separation between different classes. Calculations behind the algorithm are not very complicated. In spite of that it is still useful method used in data science and could be used for example in classification pipeline.

References

Eigenvalue and Generalized Eigenvalue Problems: Tutorial, ArXiv
Linear discriminant analysis, Wikipedia
Linear Discriminant Analysis – Bit by Bit, sebastianraschka
Linear Discriminant Analysis (LDA), San José State University
Pattern Recognition-Lecture 8, Prof. Olga Veksler
Scatter matrix, Wikipedia
sklearn.discriminant_analysis.LinearDiscriminantAnalysis, skicit-learn
Wine Data Set, UCI Machine Learning Repository