Conventional guide to Supervised learning with scikit-learn — Dimensionality reduction using Linear Discriminant Analysis: Linear and Quadratic Discriminant Analysis (18)

Venali Sonone
Jul 22 · 3 min read

This is the eighteenth part of a 92-part series of conventional guide to supervised learning with scikit-learn written with a motive to become skillful at implementing algorithms to productive use and being able to explain the algorithmic logic underlying it. Please find links to all parts in the first article.

Linear Discriminant Analysis and Quadratic Discriminant Analysis are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively.

Discriminant analysis in general principle follows the principle of creating one or more linear predictors that are not directly the feature but rather derived from original features.
The above can be reframed as LDA which creates new latent variables.

Discriminant functions

Let’s consider a group of predictors j with sample R(j). The LDA states that there is the discriminant rule which finds an x corresponding to j which classifies in such a way to minimize classification error.
The rules have a discriminant score to determine how well prediction is made.

This can be stated in three statements below:
Structure Correlation Coefficients: The features are not correlated.
Standardized Coefficients: The coefficients or constant weights in the linear equation has a unique weight for each feature.
Group Centroids: The means of the feature belonging to a particular class are further apart.

Discrimination rules

Maximum likelihood: The new latent feature should be assigned to a group so the destiny of the population of the group is maximum.
Bayes Discriminant Rule: The new latent feature should be assigned to a group so the conditional probability of feature population in the group is maximum.
Fisher’s linear discriminant rule: The new feature should have a maximum ratio of the total sum of squares (SS) between features and within features should be maximum to find a linear combination of the features to predict the group.

Let’s look at some more detailed explanation:

Our task is to have linear representation by LDA to we can find the separation in linear space. Consider the 2-d graph on left. We want to derive the linear representation 1-d on the right so we can one number (a threshold on aka number line to separate the two classes). What’s the best way to reduce the dimensions?

Let’s start looking the not recommended way…one way is to project everything on the X-axis.

Or others could be to project it on the third axis

This would look like a linear space after projection…

The new axis mathematically is projected using the below equations.

Let’s get straight into coding to understand the concepts we have discussed.


All credits go to Scikit-learn documentation and all references are as per official user guide.

Also thanks to my friend who believes that “success to me is if I’ve created enough impact so that the world’s a better place” which motivates me to start from scratch so as to create a difference at some point.

About the Author

I am venali sonone, a data scientist by profession and also management student, aspiring to advance my career in the financial industry.

Data Driven Investor

from confusion to clarity, not insanity

Thanks to Justin Chan

Venali Sonone

Written by

Data Scientist by profession and just lazy by nature.

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade