Binary Classification
To be, or not to be: that is the question
Originally published on my blog: Binary Classification
Introduction
Binary Classification as the name suggests is the task of classifying elements into one of two classes/groups. Some applications of binary classification are:
- Testing if a person has a particular disease or not
- Classifying email as spam or not spam
- Credit card fraud detection, etc.
It is a form of supervised learning where
- Given a set of observations
- A model needs to be trained based on those observations
- Post which the model should be able to classify new observations into one of the categories.
Methods
Some of the most commonly used methods for binary classification are:
None of these are better than the other and it totally depends on the problem/use case and the available data. Any two optimization algorithms are equivalent when their performance is averaged across all possible problems. There is no free lunch. Though it is recommended that we should start with something simple and make it more complicated if and only if necessary.
Evaluation
The simplest and most common evaluation metric for binary classification problem is accuracy.
Accuracy = (# Correct Predictions)/(# Observations)
Though it seems to be a very good metric for evaluation but it may not be desirable for every use case. Say, we are trying to detect if a person has cancer or not. Say, we try to classify 1000 people with having cancer or not and we are able to get 95% accuracy. Though it may seem like a very good model but here is the catch. If most of the samples are negative (no cancer) and the model predicts them as negative, the accuracy will be high even if some of the positive samples are predicted as negative. This is undesirable since we would not want to tell a person with cancer that he does not have cancer but we can ask a person with no cancer to undergo some tests, if required. So, we would want to prefer a model which tries not to predict positive cases as negative, i.e., has a higher recall.
Let’s say that the observations have both positive (P) and negative (N) samples. The model gives two types of predictions: predicted positive (P’) and predicted negative (N’). Based on the predictions, we can create a confusion matrix:
- True Positive (TP) : Positive and Predicted Positive
- True Negative (TN) : Negative and Predicted Negative
- False Positive (FP) : Negative and Predicted Positive
- False Negative (FN) : Positive and Predicted Negative
It can be seen that:
- P + N = # Observations
- TP + TN = True (Correct) Predictions
- FP + FN = False (Incorrect) Predictions
- TP + FN = P
- TN + FP = N
- TP + FP = P’
- TN + FN = N’
Therefore, Accuracy (ACC)= (TP + TN)/(P + N)
We can derive 8 more useful metrics based on TP, FP, TN, FN. These are:
- Recall/Sensitivity/Hit Rate/True Positive Rate (TPR) = TP/P
- Miss Rate/False Negative Rate (FNR) = FN/P
- Specificity (SPC)/True Negative Rate (TNR) = TN/N
- Fall-out/False Positive Rate (FPR) = FP/N
- Precision/Positive predictive value (PPV) = TP/P’
- False Discovery Rate (FDR) = FP/P’
- Negative predictive value (NPV) = TN/N’
- False Omission Rate (FOR) = FN/N’
Further Derivations:
- Positive Likelihood ratio (LR+) = TPR/FPR
- Negative Likelihood ratio (LR-) = FNR/TNR
Another commonly used evaluation metric is AUC (Area under the curve)/ROC (Receiver operating characteristic) curve score.
A ROC space is defined by FPR and TPR as x and y axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. AUC is the area under the ROC Curve. A higher AUC score signifies a better model.
If you want to see a working approach for a binary classification problem, check this out.
If you want to jump to an algorithm, it’ll be good to start with k-nearest neighbors algorithm.
Stay tuned as I learn and share more of my learnings on Learning Machine Learning.