Illustrative Example of Principal Component Analysis(PCA) vs Linear Discriminant Analysis(LDA): Is PCA good guy or bad guy ?

Published in

Analytics Vidhya

7 min readDec 11, 2019

Hi, in this post I am gonna explain how dimensionality reduction techniques affect the prediction model. Here we are using Iris dataset and K-NN classifier. We are going to compare PCA and LDA on Iris dataset.

Before jumping into the experiments, It would be better to brush up the concepts of PCA and LDA. So I will try to explain them in short notes. Lets us start with PCA.

Principal component analysis (PCA)

PCA is a statistical tool often used for dimensionality reduction. It helps to convert higher dimensional data to lower dimensions before applying any ML model. It is an unsupervised learning Algorithm. Let me start by explaining with an example of what PCA does. In the below image we see an object from the top, bottom and side views. Note that, an object can have 360 views. If this mug(object) is the data then PCA helps us to find the views(directions) where the largest portion of the cup is seen. For example, if there are only side and bottom views, PCA gives us a side view because a large area of tea is visible. Here the side view is considered as the first principal component. After getting the first principal component, we rotate the cup in directions perpendicular to the first principal component. The direction that covers the largest portion and perpendicular to the first principal component is called the second principal component. We can find third, fourth and so on in this way.

Image 1: Different views of an object (source: google search)

Steps to perform PCA:

Normalize the data and find the co variance matrix. Then find the eigen vectors and respective eigenvalues. The first principal component is nothing but the eigen vector with the largest eigenvalue and so on.

Linear Discriminant Analysis(LDA):

LDA is a supervised dimensionality reduction technique. It makes assumptions on data. It is the generalization of Fischer’s Linear Discriminant. The LDA doesn't find the principal components. Instead, it increases the inter-class distance and decreases the intraclass distance. A detailed explanation about LDA can be found here.

Load necessary libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")

Load the IRIS data and perform standardization

dataset = pd.read_csv('iris.csv') #read the data into dataframe
X = dataset.iloc[:, :-1].values   #store the dependent features in X
y = dataset.iloc[:, 4].values   #store the independent variable in y
X = StandardScaler().fit_transform(X)

Perform PCA and visualize the data

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()# configuring the parameteres
# the number of components = 2
# we have taken only 2 components as it is easy to visualize
pca.n_components = 2
# pca_reduced will contain the 2-d projects of simple data
pca_data = pca.fit_transform(X)
print("shape of pca_reduced.shape = ", pca_data.shape)#>>>   shape of pca_reduced.shape =  (150, 2)# attaching the label for each 2-d data point
pca_data = np.vstack((pca_data.T, y)).T# creating a new data from which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))sn.FacetGrid(pca_df, hue="label", size=4).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()

PCA for Visualizing IRIS data using two priciple components

Plot number of principle components vs cumulative maximum variance explained

# PCA for dimensionality redcution (not-visualization)
pca.n_components = 4
pca_data = pca.fit_transform(X)percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_)cum_var_explained = np.cumsum(percentage_var_explained)# Plot the PCA spectrum
plt.figure(1, figsize=(6, 4))
plt.xticks(np.arange(0, 4, step=1),(1,2,3,4))
plt.plot(cum_var_explained, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()

If we take 1-dimensions, approx. 72% of variance is explained and if we take 2-dimensions, approx. 95% of variance is explained.

plot depicting the variance explained by features

Perform LDA and visualize the data

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
lda_data = lda.fit(X, y).transform(X)# attaching the label for each 2-d data point
lda_data = np.vstack((lda_data.T, y)).T# creating a new data fram which help us in ploting the result data
lda_df = pd.DataFrame(data=lda_data, columns=("1st_principal", "2nd_principal", "label"))sn.FacetGrid(lda_df, hue="label", size=4).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()

LDA for visualizing IRIS data using linear discriminant’s

Applying K-NN on original IRIS DATA

def divide_training_dataset_to_k_folds(x_train,y_train,folds):
 temp = len(x_train)/folds
 x_train = x_train.tolist()
 y_train = y_train.tolist()
 group = []
 label = []
 end = 0.0
 while end < len(x_train):
  group.append(x_train[int(end):int(end + temp)])
  label.append(y_train[int(end):int(end + temp)])
  end += temp
 return group,label

Define Random Cross Validation Technique:

from sklearn.metrics import accuracy_score
def RandomSearchCV(x_train,y_train,classifier, param_range, folds):
 # x_train: its numpy array of shape, (n,d)
 # y_train: its numpy array of shape, (n,) or (n,1)
 # classifier: its typically KNeighborsClassifier()
 # param_range: its a tuple like (a,b) a < b
 # folds: an integer, represents number of folds we need to devide the data and test our model
 params = list(range(1,51))
 #1.divide numbers ranging from  0 to len(X_train) into groups= folds
 # ex: folds=3, and len(x_train)=100, we can devide numbers from 0 to 100 into 3 groups i.e: group 1: 0-33, group 2:34-66, group 3: 67-100
 temp = len(x_train)/folds
 temp = int(temp)
groups,labels = divide_training_dataset_to_k_folds(x_train,y_train, folds) #2.for each hyperparameter that we generated in step 1 and using the above groups we have created in step 2 you will do cross-validation as follows:
 # first we will keep group 1+group 2 i.e. 0-66 as train data and group 3: 67-100 as test data, and find train and test accuracies
 # second we will keep group 1+group 3 i.e. 0-33, 67-100 as train data and group 2: 34-66 as test data, and find train and test accuracies
 # third we will keep group 2+group 3 i.e. 34-100 as train data and group 1: 0-33 as test data, and find train and test accuracies
 # based on the 'folds' value we will do the same procedure
 # find the mean of train accuracies of above 3 steps and store in a list "train_scores"
 # find the mean of test accuracies of above 3 steps and store in a list "test_scores" train_scores = []
 test_scores  = []
 for k in tqdm(params):
  trainscores_folds = []
  testscores_folds = []  
  for i in range(folds):
   X_train = [groups[iter] for iter in range(folds) if iter != i]
   X_train = [j for sublist in X_train for j in sublist]
   Y_train = [labels[iter] for iter in range(folds) if iter != i]
   Y_train = [j for sublist in Y_train for j in sublist]
   X_test  = groups[i]
   Y_test  = labels[i]
   classifier.n_neighbors = k
   classifier.fit(X_train,Y_train)
   Y_predicted = classifier.predict(X_test)
   testscores_folds.append(accuracy_score(Y_test, Y_predicted))
   Y_predicted = classifier.predict(X_train)
   trainscores_folds.append(accuracy_score(Y_train, Y_predicted))
  train_scores.append(np.mean(np.array(trainscores_folds)))
  test_scores.append(np.mean(np.array(testscores_folds)))
#3. return both "train_scores" and "test_scores"
 return train_scores, test_scores,params

K NN classifier

from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as pltclassifier = KNeighborsClassifier()
param_range = (1,50)
folds = 3X = dataset.iloc[:, :-1].values#store the dependent features in X
y = dataset.iloc[:, 4].values  #store the independent variable in y
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, random_state=42,test_size=0.30)
trainscores,testscores,params=RandomSearchCV(X_train,y_train,classifier, param_range, folds)#  plot hyper-parameter vs accuracy plot as shown in reference notebook and choose the best hyperparameterplt.plot(params,trainscores, label='train curve')
plt.plot(params,testscores, label='test curve')
plt.title('Hyper-parameter VS accuracy plot')
plt.legend()
plt.show()

Applying K-NN on modified IRIS DATA using PCA

X = pca_df.iloc[:, :-1].values#store all the dependent features in X
y = pca_df.iloc[:, -1].values   #store the independent variable in yX_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, random_state=42,test_size=0.30) #training data = 70% and test data = 30%trainscores,testscores,params=RandomSearchCV(X_train,y_train,classifier, param_range, folds)#  plot hyper-parameter vs accuracy plot as shown in reference notebook and choose the best hyperparameterplt.plot(params,trainscores, label='train curve')
plt.plot(params,testscores, label='test curve')
plt.title('Hyper-parameter VS accuracy plot')
plt.legend()
plt.show()

Applying K-NN on modified IRIS DATA using LDA

X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y, random_state=42,test_size=0.30)trainscores,testscores,params=RandomSearchCV(X_train,y_train,classifier, param_range, folds)#  plot hyper-parameter vs accuracy plot as shown in reference notebook and choose the best hyperparameterplt.plot(params,trainscores, label='train curve')
plt.plot(params,testscores, label='test curve')
plt.title('Hyper-parameter VS accuracy plot')
plt.legend()
plt.show()

Conclusion:

To sum it up, We can observe from the above results that PCA performed poorly on labelled data. On the other hand, LDA haven't decreased the performance of K NN model and also, it reduced the complexity of data set. Since PCA is unsupervised technique, it doesn't take into account the class labels. Therefore, we can conclude that LDA is better dimensionality reduction technique than PCA for labelled data.

Link for code: github

Note: reduced dataset using LDA gave same accuracy as original dataset i.e: 97%. However, reduced dataset using PCA gave an accuracy of 91% which is very less!!!