Basic intuition of LDA

Priyam Bajpai
Nerd For Tech
Published in
5 min readAug 1, 2021

Whenever we come across Machine learning models involving classification of image data or have to deal with vectors of a complex dimensionality, computation becomes a barrier in getting timely results. It is therefore intuitive to use algorithms that reduce computation complexity involving vectors and help in getting timely and better results. And we will talk about one such technique of classification that uses dimensionality reduction - Linear Discriminant Analysis or LDA.

Classification of data into clusters
Image courtesy by Edureka

Dimensionality Reduction and LDA

Suppose we are working on a dataset containing images. Our aim is to train a model that classifies the images into two or more categories. So what are these images in the dataset essentially? They are vectors of higher dimensions that need to be worked upon, causing a lot of computation effort and delay. A good example would be a model working on human genes, which usually includes huge amount of multidimensional data. Reducing the dimensions can not only save time and resources, but can also help in doing away with unnecessary complexity.
Linear Discriminant Analysis(LDA) is a method used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. The aim is to find axis/axes such that when data is projected, maximum distance between the classes is ensured and there is minimum scattering or variance of data in each class, which is intuitively a requirement for good classification models.
The work of LDA is almost same as that of PCA, with some minor differences-

  1. PCA is an unsupervised technique whereas LDA is supervised i.e. it uses labels for training.
  2. PCA finds an axis that maximizes the variance of the data whereas LDA finds an axis that maximizes the distance between clusters while minimizing the variance in each class.

In short, our aim is to find such axes that help us to reduce the dimensionality of the complex dataset and at the same time, help us classify the given data into desired clusters in an efficient way. It is worth noting that for a k-class classification problem, we can find only first k-1 such axes — for a three class dataset, we can find only first two axes, the details to which will be looked upon later.

Theory and Method

Let us take a binary classification dataset that has a matrix X = [x₁ x₂ x₃ x₄ … ] up to n points, where each point in X is a vector of dimension d, thus making X a d⨉n matrix. Also we have a 1⨉n matrix Y= [y₁ y₂ y₃ y₄ … ] that contains the class encodings i.e. y ∈{0,1} where 0 and 1 represent encodings of the two classes.
Our aim is to project the points into an axis such that the inter-class distance is maximum and the inner variance of the class is minimum. Let us assume w₁ to be a vector of dimension d⨉1 that is supposed to be one of our axis. So the projected values of x₁ along w₁ would be w₁ᵀx₁, where w₁ᵀ is transpose of w₁, with dimensions 1⨉d, thus making the product a scaler quantity. Thus, we try to project all the values of X on w₁.

By definition, mean of class 0, represented as μ₀ is ⅟ₙ∑xᵢ such that yᵢ is 0. Similarly, μ₁ = ⅟ₙ∑xᵢ such that yᵢ is 1. If we multiply each term with w₁ᵀ, the new mean would too get multiplied by the same factor, i.e. w₁ᵀ.
So, now that we have the means of both the clusters, our aim is to maximize the distance between them. To accommodate the prospects of a negative result, we try to maximize the value of (w₁ᵀμ₀-w₁ᵀμ₁)². By definition,

(w₁ᵀμ₀-w₁ᵀμ₁)² = (w₁ᵀμ₀-w₁ᵀμ₁)ᵀ(w₁ᵀμ₀-w₁ᵀμ₁)
= (μ₀-μ₁)ᵀ(w₁)(w₁ᵀ)(μ₀-μ₁)
= w₁ᵀ(μ₀-μ₁)ᵀ(μ₀-μ₁)w₁
= w₁ᵀ Sₑ w₁, where Sₑ is (μ₀-μ₁)ᵀ(μ₀-μ₁) or the between class variance

NOTE: The above conversion is possible because the result of the multiplication is scaler, giving a scaler trace, and a scaler trace implies that the elements of the matrix can be rotated.

We also know that the form of finding covariance of a matrix X is by finding XXᵀ. So,
Variance(w₁ᵀX) = (w₁ᵀX)(w₁ᵀX)ᵀ
⇒ U₁ᵀ(wwᵀ)U₁
⇒ w₁ᵀ∑w₁, where ∑ is the covariance matrix of X, with dimension d⨉d
So, the class variance for class 0 and 1 are w₁ᵀ ∑₀ w₁ and w₁ᵀ ∑₁ w₁ respectively. To minimize them both, we can consider their sum to be minimum. So,
w₁ᵀ(∑₀+∑₁)w₁ or w₁ᵀ(Sᵢ)w₁ is considered to be minimum,
where Sᵢ = ∑₀+∑₁

Now, we have between class variance w₁ᵀ Sₑ w₁ and in-class variance w₁ᵀ Sᵢ w₁, where between class variance is to be maximized and in-class variance is to be minimized, So, collectively, their ratio can be maximized. This can be done by fixing w₁ᵀ Sᵢ w₁ = 1 and maximizing w₁ᵀ Sₑ w₁.

Now, we will use the lagrangian form to solve the equation-
L(w₁, λ₁) = (w₁ᵀ Sₑ w₁)-λ₁(w₁ᵀ Sᵢ w₁-1)
To obtain the maximum/saddle point, we partially differentiate this w.r.t. w₁ and equate to 0, and, sidelining the complex calculations, get the following equation -
2 Sₑ w₁- 2λ₁Sᵢ w₁ = 0
⇒ Sₑ w₁ = λ₁Sᵢ w₁
⇒ (Sᵢ´ Sₑ) w₁ = λ₁w₁, where Sᵢ´ is inverse of Sᵢ

which is the representation of the eigenvalue and eigenvector pairs. The catch here is that Sₑ is the product of two 1d vectors and thus has rank 1, So, the overall rank of the product would be 1.
Therefore, only one pair of eigenvector and eigenvalue would exist.

Extension to k classes

Now we extend it to a k class dataset. We would replace w₁ with W = [w₁ w₂ w₃ … wₖ-₁] which is as expected (for k classes, k-1 axes are available). So, the dimensions of W are d⨉(k-1).

We need to figure out a way to use this matrix W for maximizing the ratio, so one way is to possibly use trace of (w₁ᵀ Sₑ w₁) and (w₁ᵀ Sᵢ w₁) which would be equivalent to summing up the individual variances.
So Trace(w₁ᵀ Sₑ w₁)/trace(w₁ᵀ Sᵢ w₁) is to be maximized.
Using the similar concept of lagrangian, we get an equation-
trace(w₁ᵀ Sₑ w₁)-λ₁(trace(w₁ᵀ Sᵢ w₁)-1) and this would eventually give us k-1 eigen vectors, since (Sᵢ´ Sₑ) has k-1 rank.

Now, it is easy to calculate Sᵢ as it is just the sum of the individual in-class variances. But finding the new Sₑ with more than two classes would pose a problem, but we have a solution! We know that the total variance Sₜ is the sum of in-class variance and between class variance, and it is easy to calculate Sₜ. So by reverse calculation, we get Sₑ.

Sₜ = ⅟ₙ∑(xᵢ-μ)(xᵢ-μ)ᵀ, where μ is the mean of all xᵢ.

Now, we have all the required values and we can easily find out the first k-1 axes!

--

--

Priyam Bajpai
Nerd For Tech

A budding developer who likes exploring stuffs. I love to take up challenges till they are behind the screen.