Binary Classification

To be, or not to be: that is the question

Gaurav Chandak

Published in

Learning Machine Learning

4 min readAug 26, 2017

Originally published on my blog: Binary Classification

Introduction

Binary Classification as the name suggests is the task of classifying elements into one of two classes/groups. Some applications of binary classification are:

Testing if a person has a particular disease or not
Classifying email as spam or not spam
Credit card fraud detection, etc.

It is a form of supervised learning where

Given a set of observations
A model needs to be trained based on those observations
Post which the model should be able to classify new observations into one of the categories.

Methods

Some of the most commonly used methods for binary classification are:

None of these are better than the other and it totally depends on the problem/use case and the available data. Any two optimization algorithms are equivalent when their performance is averaged across all possible problems. There is no free lunch. Though it is recommended that we should start with something simple and make it more complicated if and only if necessary.

Evaluation

The simplest and most common evaluation metric for binary classification problem is accuracy.

Accuracy = (# Correct Predictions)/(# Observations)

Though it seems to be a very good metric for evaluation but it may not be desirable for every use case. Say, we are trying to detect if a person has cancer or not. Say, we try to classify 1000 people with having cancer or not and we are able to get 95% accuracy. Though it may seem like a very good model but here is the catch. If most of the samples are negative (no cancer) and the model predicts them as negative, the accuracy will be high even if some of the positive samples are predicted as negative. This is undesirable since we would not want to tell a person with cancer that he does not have cancer but we can ask a person with no cancer to undergo some tests, if required. So, we would want to prefer a model which tries not to predict positive cases as negative, i.e., has a higher recall.

Let’s say that the observations have both positive (P) and negative (N) samples. The model gives two types of predictions: predicted positive (P’) and predicted negative (N’). Based on the predictions, we can create a confusion matrix:

True Positive (TP) : Positive and Predicted Positive
True Negative (TN) : Negative and Predicted Negative
False Positive (FP) : Negative and Predicted Positive
False Negative (FN) : Positive and Predicted Negative

It can be seen that:

P + N = # Observations
TP + TN = True (Correct) Predictions
FP + FN = False (Incorrect) Predictions
TP + FN = P
TN + FP = N
TP + FP = P’
TN + FN = N’

Therefore, Accuracy (ACC)= (TP + TN)/(P + N)

We can derive 8 more useful metrics based on TP, FP, TN, FN. These are:

Recall/Sensitivity/Hit Rate/True Positive Rate (TPR) = TP/P
Miss Rate/False Negative Rate (FNR) = FN/P
Specificity (SPC)/True Negative Rate (TNR) = TN/N
Fall-out/False Positive Rate (FPR) = FP/N
Precision/Positive predictive value (PPV) = TP/P’
False Discovery Rate (FDR) = FP/P’
Negative predictive value (NPV) = TN/N’
False Omission Rate (FOR) = FN/N’

Further Derivations:

Positive Likelihood ratio (LR+) = TPR/FPR
Negative Likelihood ratio (LR-) = FNR/TNR

Another commonly used evaluation metric is AUC (Area under the curve)/ROC (Receiver operating characteristic) curve score.

A ROC space is defined by FPR and TPR as x and y axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. AUC is the area under the ROC Curve. A higher AUC score signifies a better model.

If you want to see a working approach for a binary classification problem, check this out.

Titanic: Machine Learning from Disaster | Kaggle

After flirting with the idea of getting into Machine Learning for far too long, I finally took my first successful step…

medium.com

If you want to jump to an algorithm, it’ll be good to start with k-nearest neighbors algorithm.

k-nearest neighbors algorithm (k-NN)

You are the average of the five people you most associate with