Binary Classification vs Hypothesis Testing explained using real life Covid-19 use-cases

Published in

Analytics Vidhya

6 min readJul 12, 2021

Binary classification is normally used for prediction tasks in Machine Learning whereas hypothesis testing is famous for performing inference tasks in statistics. Both of these tasks involve taking a binary decision. But they are used to solve very different problems. The input data and algorithms are also very different in each case. However, since both tasks are ultimately about taking a binary decision, they share quite a few evaluation metrics even though their terminologies are different. In this blog, I’ll explain both techniques using recent covid-19 related use-cases and will attempt to drive home the similarities and differences between the two theories.

Binary Classification

Objective: Given an n dimensional feature vector (x), classify it into one of the two categories ( C1 or C2).
Example — Suppose one builds a model which can take in a user’s blood parameters to predict if a person is covid infected or not. This is analogous to the popular RT-PCR test performed in laboratories.

**Classification task:** Sample Dataset for Covid19 prediction

Model Building: There are plethora of choices available for building this model like Logistic Regression, Decision Tree, Neural Networks, etc.

Prediction: We pass the features (x) of a person’s blood to the model (f(x)) and it generates an output 1 or 0.

Evaluation: Let’s assume class “Covid positive” is denoted by integer 1. And class “Covid Negative” is denoted by integer 0. During prediction, our model can make two kinds of error —

a) False Positive: Classified a person as covid positive but he is not.

b) False Negative: Classified a person as covid negative but he is actually positive.

**Binary Classification:** Confusion Matrix and some important measures

c) Sensitivity or Recall or True Positive Rate: This is a very popular measure of estimating how good our model is in catching all the actual positives in the dataset.

d) Specificity or True Negative Rate: This is the ratio between true negatives and the actual negatives in the dataset. False Positive rate is 1 -Specificity.

e) Precision or Predicted Positive Rate: This is again a very popular metric and it is the proportion of true positives out of the predicted positives.

One single metric is not good enough for imbalanced datasets. Hence, usually a combination of (Precision & Recall) or (Sensitivity & Specificity) together are used to describe the performance of a classification model.

Let’s understand these metrics basis real life performance of RT-PCR Test. According to the paper, below is the performance of an RT-PCR test:

These metrics can be interpreted in different ways by different people —

Individual person — If I test positive in an RT-PCR test, there is an 11.1% (100 % –88.90 %) probability that the result was a false positive. On the other hand, if I test negative, there is only 1.6% (100% — 98.4%) probability that I may be infected with covid in reality.

Government — Out of all the positive cases in my area, the RT-PCR testing will be able to detect 84.2% of the cases but will miss out on 15.8 % (100% — 84.2%)of covid positive cases. Similar but less useful interpretation for detecting Negative cases can be made as well.

Hence, we can see that each metric has its own purpose and interpretation. Further, there are other metrics like Accuracy, ROC-AUC, F1 score which we will not discuss here.

Hypothesis Testing

Objective: Loosely speaking, given two hypothesis we need to prove which hypothesis seems more likely given the observed experimental data. First hypothesis is called “Null Hypothesis (H0)” and second is called “Alternate Hypothesis (H1)”. Here instead of features at a person or object level, we are given with a bunch of observations under each hypothesis.

**Hypothesis Testing**: Sample Dataset where the values can be boolean like 0/ 1 or continuous floats.

Example — If we are evaluating whether a covid vaccine is effective in treatment of Covid-19, then H0 means it is not-effective and H1 means it is effective. We look to reject the null hypothesis.

Algorithm: Two popular tests for hypothesis testing are Z-test (used for proportion metrics) and T-Test (used for continuous metrics). For testing a vaccine’s effectiveness, we can use two sample proportion test (Z-test).

Inference: If the z-value (or p-value) obtained is less than critical value (or alpha), then we reject H0 else we fail to reject H0. Inability to reject H0 doesn’t mean that H0 is true. It may happen that the experiment just doesn’t have enough power (sample size) to reject H0.

Evaluation: We can make two kinds of error here as well.

a) Type I Error: If we end up rejecting the null hypothesis(H0) when in reality it is true. This is analogous to a False Positive in case of Classification.

b) Type II Error: If we end up not rejecting the null hypothesis(H0) when in reality it is False. This is analogous to a False Negative in case of Classification.

Performance measure of Hypothesis Testing

c) Alpha: Alpha is the probability that we incorrectly reject H0. The orange area in above graph. 1- alpha is same as “Specificity” in Classification. Typically this value is set as 5%.

d) Beta: Beta is the probability that we did not reject H0 when it was actually False. The purple area in the graph reflects beta.

e) Power: It is defined as 1 - Beta which is equivalent to Sensitivity in Classification. It denotes the probability of correctly rejecting the null hypothesis. The white striped area represents power. Typically we choose sample size so that power is atleast 80%.

Let’s understand these terms using a real life example of Moderna’s vaccine efficacy results.

Clinical trial results for Moderna’s mRNA -1273 Vaccine

As per this report, scientists wanted to measure the efficacy of mRNA -1273 vaccine. For this purpose, they gave 2 doses of vaccine to 14,550 people and 2 doses of placebo to 14,598 people. After some days, they observed how many people got infected with covid-19. The above table contains the result.

p_placebo    = (185+30) / 14598 = 0.014728045
p_mRNA-1273  = (11 +0)  / 14550 = 0.000756014Efficacy = delta% = (p_placebo - p_mRNA-1273)/p_placebo ~ 94 %Note: Confidence Interval and P-value computation for delta% is a bit involved procedure and hence not shown in this blog.

A low p-value (or False Positive Rate) from above image suggest that the efficacy of 94% is significant. In other terms, the probability of a type-I error is < 0.001. Interested reader can check out this stack overflow link to understand the p-value and CI computation for a delta % metric.

Summary and Conclusion

Side by Side comparison between Classification and Hypothesis Test

The above table provides a summary of comparison which can help us in understanding the relationship between the two very popular techniques. Hopefully, this article clears some doubts which might have arisen by reading these two topics separately in literature.

Open Questions [Please put your thoughts in comments]

Is there an analogous to accuracy metric in hypothesis test ?
Is there an analogous to ROC-AUC metric in hypothesis test ?

References —

Binary Classification vs Hypothesis Testing explained using real life Covid-19 use-cases

Written by Pankaj Agarwal