Data Science :: Performance of Each Classification Model

Intention of this post is to give a quick refresher (thus, it’s assumed that you are already familiar with the stuff) on “advantages and disadvantages of each Classification Model”. You can treat this as FAQ’s or Interview Questions as well.

What is a false positive and false negative?
A prediction which predicts true for an actual value of false is called false positive.
A prediction which predicts false for an actual value of true is called false negative.

*******************************************

Can you explain confusion matrix with a simple diagram?
[
(True Negative, False Positive),
(False Negative, True Positive)
]

*******************************************

What is CAP?
Cumulative Accuracy Profile. CAP is used for comparing different models. A model with deeper CAP curve (larger gap in area between the model’s curve and the random prediction line), is considered better model.

*******************************************

How do you analyze CAP?
After drawing a curve based on the model, have the prediction for a value which is 50% of the independent variable. Let the prediction be known as X.
If X < 60%, then it’s a useless model 
If 60% < X < 70%, then it’s a Poor model
If 70% < X < 80%, then it’s a Good model
If 80% < X < 90%, then it’s a Very Good model
If 90% < X < 100%, then it’s a Too Good to be right model. Might be a over-fitting model.

*******************************************

How do we know which model to choose?
From mathematical point of view ::
If the problem is linear, you should go for Logistic Regression or SVM.
If the problem is non linear, you should go for K-NN, Naive Bayes, Decision Tree or Random Forest.
From business point of view ::
- Logistic Regression or Naive Bayes when you want to rank your predictions by their probability. For example if you want to rank your customers from the highest probability that they buy a certain product, to the lowest probability. Eventually that allows you to target your marketing campaigns. And of course for this type of business problem, you should use Logistic Regression if your problem is linear, and Naive Bayes if your problem is non linear.
- SVM when you want to predict to which segment your customers belong to. Segments can be any kind of segments, for example some market segments you identified earlier with clustering.
- Decision Tree when you want to have clear interpretation of your model results,
- Random Forest when you are just looking for high performance with less need for interpretation.

*******************************************

Next :: Data Science (Python) :: K-Means Clustering

Prev :: Data Science (Python) :: Decision Tree Classification & Random Forest Classification

If you liked this article, please hit the ❤ icon below