How To Implement Linear Discriminant Analysis in R?
Linear Discriminant Analysis is a very popular Machine Learning technique that is used to solve classification problems. In this article, we will try to understand the intuition and mathematics behind this technique. An example of the implementation of LDA in R is also provided.
- Linear Discriminant Analysis Assumption
- Intuitions
- Mathematical Description of LDA
- Learning the Model Parameters
- Example in R
So let us get started then
Linear Discriminant Analysis Assumption
Linear Discriminant Analysis is based on the following assumptions:
- The dependent variable Y is discrete. In this article, we will assume that the dependent variable is binary and takes class values {+1, -1}. The probability of a sample belonging to class +1, i.e P(Y = +1) = p. Therefore, the probability of a sample belonging to class -1 is 1-p.
- The independent variable(s) X come from Gaussian distributions. The mean of the Gaussian distribution depends on the class label Y. i.e. if Yi = +1, then the mean of Xi is ๐+1, else it is ๐-1. The variance ๐2 is the same for both classes. Mathematically speaking, X|(Y = +1) ~ N(๐+1, ๐2) and X|(Y = -1) ~ N(๐-1, ๐2), where N denotes the normal distribution.
With this information, it is possible to construct a joint distribution P(X, Y) for the independent and dependent variable. Therefore, LDA belongs to the class of Generative Classifier Models. A closely related generative classifier is Quadratic Discriminant Analysis(QDA). It is based on all the same assumptions of LDA, except that the class variances are different.
Let us continue with Linear Discriminant Analysis article and see
Intuition
Consider the class conditional Gaussian distributions for X given the class Y. The below figure shows the density functions of the distributions. In this figure, if Y = +1, then the mean of X is 10 and if Y = -1, the mean is 2. The variance is 2 in both cases.
Now suppose a new value of X is given to us. Lets just denote it as xi. The task is to determine the most likely class label for this xi, i.e. yi. For simplicity assume that the probability p of the sample belonging to class +1 is the same as that of belonging to class -1, i.e. p=0.5.
Intuitively, it makes sense to say that if xi is closer to ๐+1 than it is to ๐-1, then it is more likely that yi = +1. More formally, yi = +1 if:
|xi โ ๐+1| < |xi โ ๐-1|
Normalizing both sides by the standard deviation:
|xi โ ๐+1|/๐ < |xi โ ๐-1|/๐
Squaring both sides:
(xi โ ๐+1)2/๐2 < (xi โ ๐-1)2/๐2
xi2/๐2 + ๐+12/๐2โ2 xi๐+1/๐2 < xi2/๐2 + ๐-12/๐2โ2 xi๐-1/๐2
2 xi (๐-1 โ ๐+1)/๐2 โ (๐-12/๐2 โ ๐+12/๐2) < 0
-2 xi (๐-1 โ ๐+1)/๐2 + (๐-12/๐2 โ ๐+12/๐2) > 0
The above expression is of the form bxi + c > 0 where b = -2(๐-1 โ ๐+1)/๐2 and c = (๐-12/๐2 โ ๐+12/๐2).
It is apparent that the form of the equation is linear, hence the name Linear Discriminant Analysis.
Let us continue with Linear Discriminant Analysis article and see,
Mathematical Description of LDA
The mathematical derivation of the expression for LDA is based on concepts like Bayes Rule and Bayes Optimal Classifier.
We will provide the expression directly for our specific case where Y takes two classes {+1, -1}. We will also extend the intuition shown in the previous section to the general case where X can be multidimensional. Letโs say that there are k independent variables. In this case, the class means ๐-1 and ๐+1 would be vectors of dimensions k*1 and the variance-covariance matrix ๐ฎ would be a matrix of dimensions k*k.
The classifier function is given as
Y = h(X) = sign(bTX + c)
Where,
b = -2 ๐ฎ -1(๐-1 โ ๐+1)
c = ๐-1T๐ฎ -1๐-1 โ ๐-1T๐ฎ -1๐-1 -2 ln{(1-p)/p}
The sign function returns +1 if the expression bTx + c > 0, otherwise it returns -1. The natural log term in c is present to adjust for the fact that the class probabilities need not be equal for both the classes, i.e. p could be any value between (0, 1), and not just 0.5.
Learning the Model Parameters
Given a dataset with N data-points (x1, y1), (x2, y2), โฆ (xn, yn), we need to estimate p, ๐-1, ๐+1 and ๐ฎ. A statistical estimation technique called Maximum Likelihood Estimation is used to estimate these parameters. The expressions for the above parameters are given below.
๐+1= (1/N+1) * ๐บi:yi=+1 xi
๐-1 = (1/N-1) * ๐บi:yi=-1 xi
p = N+1/N
๐ฎ = (1/N) * ๐บi=1:N(xi โ ๐i)(xi โ ๐i)T
Where N+1 = number of samples where yi = +1 and N-1 = number of samples where yi = -1.
With the above expressions, the LDA model is complete. One can estimate the model parameters using the above expressions and use them in the classifier function to get the class label of any new input value of independent variable X.
Let us continue with Linear Discriminant Analysis article and see
Example in R
The following code generates a dummy data set with two independent variables X1 and X2 and a dependent variable Y. For X1 and X2, we will generate a sample from two multivariate Gaussian distributions with means ๐-1= (2, 2) and ๐+1= (6, 6). 40% of the samples belong to class +1 and 60% belong to class -1, therefore p = 0.4.
library(ggplot2)
library(MASS)
library(mvtnorm)
#Variance Covariance matrix for random bivariate gaussian sample
var_covar = matrix(data = c(1.5, 0.3, 0.3, 1.5), nrow=2)
#Random bivariate gaussian samples for class +1
Xplus1 <- rmvnorm(400, mean = c(6, 6), sigma = var_covar)
# Random bivariate gaussian samples for class -1
Xminus1 <- rmvnorm(600, mean = c(2, 2), sigma = var_covar)
#Samples for the dependent variable
Y_samples <- c(rep(1, 400), rep(-1, 600))
#Combining the independent and dependent variables into a dataframe
dataset <- as.data.frame(cbind(rbind(Xplus1, Xminus1), Y_samples))
colnames(dataset) <- c("X1", "X2", "Y")
dataset$Y <- as.character(dataset$Y)
#Plot the above samples and color by class labels
ggplot(data = dataset)+
geom_point(aes(X1, X2, color = Y))
In the above figure, the blue dots represent samples from class +1 and the red ones represent the sample from class -1. There is some overlap between the samples, i.e. the classes cannot be separated completely with a simple line. In other words, they are not perfectly linearly separable.
We will now train an LDA model using the above data.
#Train the LDA model using the above dataset
lda_model <- lda(Y ~ X1 + X2, data = dataset)
#Print the LDA model
lda_model
Output:
Prior probabilities of groups:
-1 1
0.6 0.4
Group means:
X1 X2
-1 1.928108 2.010226
1 5.961004 6.015438
Coefficients of linear discriminants:
LD1
X1 0.5646116
X2 0.5004175
As one can see, the class means learnt by the model are (1.928108, 2.010226) for class -1 and (5.961004, 6.015438) for class +1. These means are very close to the class means we had used to generate these random samples. The prior probability for group +1 is the estimate for the parameter p. The b vector is the linear discriminant coefficients.
We will now use the above model to predict the class labels for the same data.
#Predicting the class for each sample in the above dataset using the LDA model
y_pred <- predict(lda_model, newdata = dataset)$class
#Adding the predictions as another column in the dataframe
dataset$Y_lda_prediction <- as.character(y_pred)
#Plot the above samples and color by actual and predicted class labels
dataset$Y_actual_pred <- paste(dataset$Y, dataset$Y_lda_prediction, sep=",")
ggplot(data = dataset)+
geom_point(aes(X1, X2, color = Y_actual_pred))
In the above figure, the purple samples are from class +1 that were classified correctly by the LDA model. Similarly, the red samples are from class -1 that were classified correctly. The blue ones are from class +1 but were classified incorrectly as -1. The green ones are from class -1 which were misclassified as +1. The misclassifications are happening because these samples are closer to the other class mean (center) than their actual class mean.
If you wish to check out more articles on the marketโs most trending technologies like Python, DevOps, Ethical Hacking, then you can refer to Edurekaโs official site.
Do look out for other articles in this series which will explain the various other aspects of Data Science.
2.Math And Statistics For Data Science
9.Introduction To Machine Learning
12.How To Create A Perfect Decision Tree?
13.Top 10 Myths Regarding Data Scientists Roles
15.Data Analyst vs Data Engineer vs Data Scientist
16.Types Of Artificial Intelligence
17.R vs Python
18.Artificial Intelligence vs Machine Learning vs Deep Learning
20.Data Analyst Interview Questions And Answers
21.Data Science And Machine Learning Tools For Non-Programmers
22.Top 10 Machine Learning Frameworks
23.Statistics for Machine Learning
25.Breadth-First Search Algorithm
27.Prerequisites for Machine Learning
28.Interactive WebApps using R Shiny
29.Top 10 Books for Machine Learning
Originally published at www.edureka.co on July 24, 2019.