Comparing the Performance of Logistic Regression, LDA, QDA, Naive Bayes, and KNN

SanjanaGarg
3 min readJul 5, 2024

--

In this post, I would like to share some thoughts on the empirical performance of several classification methods: logistic regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), naive Bayes, and k-Nearest Neighbours (KNN). These methods are examined across six different scenarios involving binary (two-class) classification problems. In three scenarios, the Bayes decision boundary is linear, while in the remaining three, it is non-linear. For each scenario, consider 100 random training data sets and compute the test error rate on a large test set. The results for linear scenarios are shown in Figure 1, and the results for non-linear scenarios are in Figure 2.

Fig 1: Boxplots of the test error rates for each of the linear scenarios
Figure 1:Boxplots of the test error rates for each of the linear scenarios

Scenarios and Results

Linear Scenarios

  • Scenario 1: 20 observations per class, uncorrelated random normal variables with different means.
  • Results: LDA and logistic regression performed well. KNN performed poorly due to high variance. QDA also underperformed, as it fits a more flexible model than needed. Naive Bayes performed slightly better than QDA, benefiting from the correct independence assumption.
  • Scenario 2: Same as Scenario 1, but with a correlation of -0.5 between predictors. Results: Performance was similar to Scenario 1, except naive Bayes performed poorly due to violated independence assumption.
  • Scenario 3: Substantial negative correlation, predictors from the t-distribution, 50 observations per class. Results: Logistic regression outperformed LDA, both superior to other methods. QDA performed poorly due to non-normality, and naive Bayes was affected by the independence assumption violation.
Figure 2: Box plots of the test error rates for each of the non-linear scenarios

Non-Linear Scenarios

  • Scenario 4: Normal distribution, correlation of 0.5 in one class and -0.5 in the other, quadratic decision boundaries. Results: QDA outperformed other methods. Naive Bayes performed poorly due to violated independence assumption.
  • Scenario 5: Normal distribution, uncorrelated predictors, responses sampled from a non-linear logistic function. Results: QDA and naive Bayes gave better results than linear methods, but KNN-CV performed best. KNN with K=1K=1 performed worst, highlighting the need for correct smoothness level in non-parametric methods.
  • Scenario 6: Normal distribution with different diagonal covariance matrices, small sample size (n=6 per class). Results: Naive Bayes performed well due to met assumptions. LDA and logistic regression performed poorly due to non-linear decision boundary. QDA and KNN suffered from high variance due to small sample size.

Conclusion

These scenarios illustrate that no single method outperforms others in all situations. Linear methods like LDA and logistic regression excel when decision boundaries are linear. For moderately non-linear boundaries, QDA or naive Bayes may perform better. For complex decision boundaries, non-parametric approaches like KNN can be superior, provided the smoothness level is chosen correctly.

Source: An Introduction to Statistical Learning with Python

--

--