Statistics in Machine Learning - Medium

Converting AUC to Odds Ratio (OR): A Comprehensive Guide Using Python and MLstatkit

Yong Zhen Huang — Wed, 16 Oct 2024 23:22:21 GMT

Introduction

In the evaluation of diagnostic tests and binary classification models, the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve is a widely used metric. While AUC provides a measure of a model’s discriminative ability, clinicians and researchers often prefer effect size measures like the Odds Ratio (OR) for their interpretability in clinical contexts.

Converting AUC to OR bridges the gap between statistical model evaluation and clinical interpretation. In this article, we will explore the relationship between AUC and OR, discuss their clinical significance, and demonstrate how to perform the conversion using Python. We’ll also introduce MLstatkit, a library that simplifies this process with its AUC2OR function.

Understanding AUC and Odds Ratio

What is AUC?

The Area Under the ROC Curve (AUC) is a measure of a classifier’s ability to distinguish between positive and negative classes. It ranges from 0 to 1, where:

AUC = 0.5: Model has no discriminative ability (equivalent to random guessing).
AUC > 0.5: Model performs better than random.
AUC = 1: Perfect classification.

What is Odds Ratio (OR)?

The Odds Ratio (OR) is a measure of the association between an exposure and an outcome. It represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring without that exposure.

OR = 1: No association between exposure and outcome.
OR > 1: Exposure is associated with higher odds of the outcome.
OR < 1: Exposure is associated with lower odds of the outcome.

Clinical Significance

While AUC is useful for evaluating model performance, OR is more interpretable in clinical settings. OR provides a direct measure of how much more likely (or unlikely) an event is to occur in one group compared to another.

Converting AUC to OR allows clinicians to understand the impact of diagnostic tests or risk factors in terms that are more actionable for patient care.

Converting AUC to Odds Ratio

The Relationship Between AUC and OR

The conversion from AUC to OR is based on the assumption that the underlying distributions of the test results for the positive and negative classes are normally distributed with equal variance. Under this assumption, a mathematical relationship between AUC and OR can be established.

Mathematical Formulation

The conversion involves several steps, including logarithmic transformations and polynomial approximations. The key intermediate variables are:

𝓉: Derived from the AUC using a logarithmic transformation.
𝓏: Calculated from ttt using a polynomial approximation (Beasley’s approximation for the inverse error function).
𝒹: A scaling of 𝓏, representing the standardized mean difference (Cohen’s 𝒹).
ln_OR: The natural logarithm of the Odds Ratio, derived from 𝒹.
OR: The Odds Ratio.

Step-by-Step Formulas

Calculating 𝓉

Explanation: Transforms the AUC into an intermediate variable 𝓉 using a logarithmic function.

2. Calculating 𝓏

Explanation: Approximates the inverse cumulative distribution function (probit function) using Beasley’s approximation. The coefficients are directly substituted into the formula.

3. Calculating 𝒹

Explanation: Converts the z-score into Cohen’s 𝒹, a measure of effect size.

4. Calculating ln⁡(OR)

Explanation: Derives the natural logarithm of the Odds Ratio from the effect size 𝒹, utilizing the properties of the logistic distribution.

5. Calculating Odds Ratio (OR)

Explanation: Exponentiates ln⁡(OR) to obtain the Odds Ratio (OR).

Implementing the Conversion in Python

Let’s walk through the implementation of the AUC to OR conversion step by step.

The Code:

import math

def AUC2OR(AUC, return_all=False):
    """
    Converts Area Under the Curve (AUC) to Odds Ratio (OR) and optionally returns intermediate values.
    
    Parameters:
    -----------
    AUC : float
        The Area Under the Curve (AUC) value to be converted.
    return_all : bool, default=False
        If True, returns intermediate values t, z, d, and ln_OR in addition to OR.
    
    Returns:
    --------
    OR : float
        The calculated Odds Ratio (OR) from the given AUC value.
    t : float, optional
        Intermediate value calculated from AUC.
    z : float, optional
        Intermediate value calculated from t.
    d : float, optional
        Intermediate value calculated from z.
    ln_OR : float, optional
        The natural logarithm of the Odds Ratio.
    """
    
    def calculate_t(AUC):
        return math.sqrt(math.log(1 / ((1 - AUC) ** 2)))

    def calculate_z(AUC):
        t = calculate_t(AUC)
        numerator = 2.515517 + 0.802853 * t + 0.0103328 * (t ** 2)
        denominator = 1 + 1.432788 * t + 0.189269 * (t ** 2) + 0.001308 * (t ** 3)
        z = t - (numerator / denominator)
        return z

    def calculate_d(AUC):
        z = calculate_z(AUC)
        d = z * math.sqrt(2)
        return d

    t = calculate_t(AUC)
    z = calculate_z(AUC)
    d = calculate_d(AUC)
    ln_OR = (math.pi * d) / math.sqrt(3)
    OR = math.exp(ln_OR)
    
    if return_all:
        return t, z, d, ln_OR, OR
    else:
        return OR

Explaining the Implementation

1. Calculating 𝓉

def calculate_t(AUC):
    return math.sqrt(math.log(1 / ((1 - AUC) ** 2)))

Purpose: Transforms the AUC into an intermediate variable 𝓉 using a logarithmic function.
Explanation: This step adjusts the AUC to account for the cumulative distribution function of the normal distribution.

2. Calculating 𝓏

def calculate_z(AUC):
    t = calculate_t(AUC)
    numerator = 2.515517 + 0.802853 * t + 0.0103328 * (t ** 2)
    denominator = 1 + 1.432788 * t + 0.189269 * (t ** 2) + 0.001308 * (t ** 3)
    z = t - (numerator / denominator)
    return z

Purpose: Approximates the inverse of the cumulative distribution function (probit function) using Beasley’s approximation.
Explanation: This polynomial approximation provides a computationally efficient way to estimate the z-score corresponding to the given AUC.

3. Calculating d

def calculate_d(AUC):
    z = calculate_z(AUC)
    d = z * math.sqrt(2)
    return d

Purpose: Converts the z-score into Cohen’s 𝒹, a measure of effect size.
Explanation: Scaling the z-score by √2 adjusts for the difference in variance between the standard normal distribution and the distribution of the effect size.

4. Calculating ln_OR and OR

ln_OR = (math.pi * d) / math.sqrt(3)
OR = math.exp(ln_OR)

Purpose: Calculates the natural logarithm of the Odds Ratio and then exponentiates it to obtain the OR.
Explanation: This relationship is derived from the logistic distribution’s properties, linking the effect size to the OR.

Example Usage

AUC = 0.7  # Example AUC value

# Convert AUC to OR and retrieve all intermediate values
t, z, d, ln_OR, OR = AUC2OR(AUC, return_all=True)

print(f"t: {t:.5f}, z: {z:.5f}, d: {d:.5f}, ln_OR: {ln_OR:.5f}, OR: {OR:.5f}")

# Convert AUC to OR without intermediate values
OR = AUC2OR(AUC)
print(f"OR: {OR:.5f}")

Output:

t: 1.55176, z: 0.52400, d: 0.74105, ln_OR: 1.34411, OR: 3.83477
OR: 3.83477

Interpretation:

𝓉: Intermediate value derived from AUC.
𝓏: Approximate z-score corresponding to the AUC.
𝒹: Cohen’s 𝒹, representing the effect size.
ln_OR: Natural logarithm of the Odds Ratio.
OR: An AUC of 0.7 corresponds to an Odds Ratio of approximately 3.83.
This means that the odds of a positive outcome are about 3.83 times higher given a positive test result.

Introducing MLstatkit’s AUC2OR Function

To streamline this conversion process, MLstatkit provides the AUC2OR function, which encapsulates all the calculations we've discussed.

Using MLstatkit’s AUC2OR

Installation

You can install MLstatkit using pip:

pip install MLstatkit

Implementation

from MLstatkit.stats import AUC2OR

AUC = 0.7  # Example AUC value

# Convert AUC to OR and retrieve all intermediate values
t, z, d, ln_OR, OR = AUC2OR(AUC, return_all=True)

print(f"t: {t:.5f}, z: {z:.5f}, d: {d:.5f}, ln_OR: {ln_OR:.5f}, OR: {OR:.5f}")

# Convert AUC to OR without intermediate values
OR = AUC2OR(AUC)
print(f"OR: {OR:.5f}")

Output:

t: 1.55176, z: 0.52400, d: 0.74105, ln_OR: 1.34411, OR: 3.83477
OR: 3.83477

Advantages of Using MLstatkit

Simplicity: Provides a straightforward interface for converting AUC to OR.
Efficiency: Optimized for performance and accuracy.
Convenience: Eliminates the need to implement complex mathematical transformations manually.

Clinical Interpretation of the Results

Converting AUC to OR allows for a more intuitive understanding of a diagnostic test’s effectiveness:

AUC of 0.7: Indicates a fair level of discrimination between positive and negative cases.
OR of 3.83: Suggests that the odds of correctly identifying a positive case are nearly four times higher than misclassifying it.

This information can aid clinicians in decision-making processes, such as evaluating the usefulness of a diagnostic test or the impact of a risk factor.

Conclusion

Understanding the relationship between AUC and Odds Ratio enhances the interpretability of model performance metrics in clinical contexts. By converting AUC to OR, we can translate statistical measures into actionable insights.

The AUC2OR function in MLstatkit simplifies this conversion, making it accessible for researchers and practitioners. Whether you’re evaluating diagnostic tests or comparing predictive models, this tool bridges the gap between statistical evaluation and clinical relevance.

References

Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36. https://doi.org/10.1148/radiology.143.1.7063747 IF: 12.1 Q1 B1 IF: 12.1 Q1 B1 IF: 12.1 Q1 B1 IF: 12.1 Q1 B1
Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12(4), 387–415. https://doi.org/10.1016/0022-2496(75)90001-2 IF: 2.2 Q2 B4 IF: 2.2 Q2 B4 IF: 2.2 Q2 B4
Szumilas, M. (2010). Explaining odds ratios. Journal of the Canadian Academy of Child and Adolescent Psychiatry, 19(3), 227. PMID: 20842279 IF: 2.9 Q2 NA
García, M. R., Sánchez, P., & Alvarado, J. M. (2018). Obtaining a Confidence Interval for AUC in Presence of Non-normality. European Journal of Psychology Applied to Legal Context, 10(2), 49–53. https://doi.org/10.5093/ejpalc2018a5 IF: 7.6 Q1 B1 IF: 7.6 Q1 B1 IF: 7.6 Q1 B1 IF: 7.6 Q1 B1 IF: 7.6 Q1 B1

Additional Resources

MLstatkit Documentation: GitHub Repository

Converting AUC to Odds Ratio (OR): A Comprehensive Guide Using Python and MLstatkit was originally published in Statistics in Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comparing ROC Curves in Machine Learning Model with DeLong’s Test: A Practical Guide Using Python…

Yong Zhen Huang — Wed, 16 Oct 2024 23:21:39 GMT

Comparing ROC Curves in Machine Learning Model with DeLong’s Test: A Practical Guide Using Python and MLstatkit

Introduction

In binary classification tasks, the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are fundamental metrics for evaluating model performance. When comparing two models, it’s essential to determine if the difference in their AUCs is statistically significant. DeLong’s test provides a statistical method to assess this significance.

In this article, we’ll delve into the principles and applications of DeLong’s test, explain how it’s implemented in Python using a provided code snippet, and introduce MLstatkit, a library offering a convenient and efficient implementation of DeLong’s test.

Understanding DeLong’s Test

What is DeLong’s Test?

DeLong’s test is a non-parametric statistical method used to compare the AUCs of two correlated ROC curves. It evaluates whether the observed difference between the AUCs of two models is statistically significant, accounting for the fact that the two models are tested on the same dataset and their predictions are therefore correlated.

Why Use DeLong’s Test?

When evaluating multiple classifiers on the same dataset, differences in AUC values might occur due to random chance rather than actual performance differences. DeLong’s test allows us to statistically assess whether one model significantly outperforms another.

Null Hypothesis (H₀): The difference between the AUCs of the two models is zero (no significant difference).
Alternative Hypothesis (H₁): The difference between the AUCs is not zero (there is a significant difference).

By calculating a p-value, DeLong’s test helps determine whether to reject the null hypothesis in favor of the alternative.

Implementing DeLong’s Test in Python

Below is the implementation of DeLong’s test in Python. We’ll explain each part of the code to help you understand how the test works.

The Code:

import numpy as np
import scipy.stats

def Delong_test(true, prob_A, prob_B):
    """
    Perform DeLong's test for comparing the AUCs of two models.

    Parameters
    ----------
    true : array-like of shape (n_samples,)
        True binary labels in range {0, 1}.
    prob_A : array-like of shape (n_samples,)
        Predicted probabilities by the first model.
    prob_B : array-like of shape (n_samples,)
        Predicted probabilities by the second model.

    Returns
    -------
    z_score : float
        The z score from comparing the AUCs of two models.
    p_value : float
        The p value from comparing the AUCs of two models.

    Example
    -------
    >>> true = [0, 1, 0, 1]
    >>> prob_A = [0.1, 0.4, 0.35, 0.8]
    >>> prob_B = [0.2, 0.3, 0.4, 0.7]
    >>> z_score, p_value = Delong_test(true, prob_A, prob_B)
    >>> print(f"Z-Score: {z_score}, P-Value: {p_value}")
    """

    def compute_midrank(x):
        J = np.argsort(x)
        Z = x[J]
        N = len(x)
        T = np.zeros(N, dtype=np.float64)
        i = 0
        while i < N:
            j = i
            while j < N and Z[j] == Z[i]:
                j += 1
            T[i:j] = 0.5 * (i + j - 1)
            i = j
        T2 = np.empty(N, dtype=np.float64)
        T2[J] = T + 1
        return T2

    def compute_ground_truth_statistics(true):
        assert np.array_equal(np.unique(true), [0, 1]), "Ground truth must be binary."
        order = (-true).argsort()
        label_1_count = int(true.sum())
        return order, label_1_count

    # Prepare data
    order, label_1_count = compute_ground_truth_statistics(np.array(true))
    sorted_probs = np.vstack((np.array(prob_A), np.array(prob_B)))[:, order]

    # Fast DeLong computation starts here
    m = label_1_count  # Number of positive samples
    n = sorted_probs.shape[1] - m  # Number of negative samples
    k = sorted_probs.shape[0]  # Number of models (2)

    # Initialize arrays for midrank computations
    tx, ty, tz = [np.empty([k, size], dtype=np.float64) for size in [m, n, m + n]]
    for r in range(k):
        positive_examples = sorted_probs[r, :m]
        negative_examples = sorted_probs[r, m:]
        tx[r, :], ty[r, :], tz[r, :] = [
            compute_midrank(examples) for examples in [positive_examples, negative_examples, sorted_probs[r, :]]
        ]

    # Calculate AUCs
    aucs = tz[:, :m].sum(axis=1) / (m * n) - (m + 1.0) / (2.0 * n)

    # Compute variance components
    v01 = (tz[:, :m] - tx[:, :]) / n
    v10 = 1.0 - (tz[:, m:] - ty[:, :]) / m

    # Compute covariance matrices
    sx = np.cov(v01)
    sy = np.cov(v10)
    delongcov = sx / m + sy / n

    # Calculating z-score and p-value
    l = np.array([[1, -1]])
    z = np.abs(np.diff(aucs)) / np.sqrt(np.dot(np.dot(l, delongcov), l.T)).flatten()
    p_value = scipy.stats.norm.sf(abs(z)) * 2

    z_score = -z[0].item()
    p_value = p_value[0].item()

    return z_score, p_value

Explaining the Implementation

1. Data Preparation

Inputs:

true: True binary labels (0 or 1).
prob_A: Predicted probabilities from Model A.
prob_B: Predicted probabilities from Model B.

Ground Truth Statistics:

The compute_ground_truth_statistics function checks that the true labels are binary and computes:
order: Indices that sort the true labels in descending order (positives first).
label_1_count: Number of positive samples.

Sorting Probabilities:

sorted_probs: Predicted probabilities of both models sorted according to the true labels (positives first).

2. Midrank Computation

The compute_midrank function calculates the midranks of the predicted probabilities, handling ties appropriately.

Process:

Sorting: Sorts the scores and keeps track of the original indices.
Ranking: Assigns ranks to the scores, averaging ranks for tied values.
Adjustment: Adds 1 to the ranks to start ranking from 1 instead of 0.

3. Fast DeLong Computation

Variables:

m: Number of positive samples.
n: Number of negative samples.
k: Number of models (2 in this case).

Midrank Arrays:

tx: Midranks for positive examples for each model.
ty: Midranks for negative examples for each model.
tz: Midranks for all examples for each model.

Loop Over Models:

For each model (r), compute midranks for positive, negative, and all examples.

4. AUC Calculation

AUC Calculation Formula:

The AUC is calculated using the following formula:

Where:

AUC𝓇: The AUC value for the rrr-th model.
𝓂: The number of positive samples.
𝓃: The number of negative samples.
tz𝓇,𝒾: The total midrank of the 𝒾-th positive sample in the 𝓇-th model.

Implementation in Code:

aucs = tz[:, :m].sum(axis=1) / (m * n) - (m + 1.0) / (2.0 * n)

Explanation:

tz[:, :m].sum(axis=1): Calculates the sum of midranks for all positive samples in each model.
/ (m * n): Divides the sum by mnmnmn to normalize the AUC value.
- (m + 1.0) / (2.0 * n): Adjustment term to correct the bias in the AUC calculation.

5. Variance and Covariance Calculation

Z-Score Calculation Formula:

Where:

AUC₁ and AUC₂: The AUC values of the two models.
𝐥 = [1,−1]: The contrast vector representing the difference between the two models.
Cov(AUC): The covariance matrix of the AUC estimates.
𝓏: The standardized Z-score.

Implementation in Code:

l = np.array([[1, -1]])
z = np.abs(np.diff(aucs)) / np.sqrt(np.dot(np.dot(l, delongcov), l.T)).flatten()

Explanation:

np.diff(aucs): Computes the absolute difference between the AUCs of the two models, ∣AUC₁ and AUC₂∣
delongcov: The covariance matrix Cov(AUC).
np.dot(np.dot(l, delongcov), l.T): Calculates the weighted sum of variances and covariances.
np.sqrt(...): Takes the square root to obtain the standard deviation.
z: The resulting Z-score.

6. Z-Score and P-Value Computation

Z-Score:

z_score = z[0].item(): Extracts the Z-score value.

P-Value:

The two-tailed p-value is calculated using the standard normal distribution

p_value = scipy.stats.norm.sf(abs(z)) * 2
p_value = p_value[0].item()

Example Usage

true = [0, 1, 0, 1]
prob_A = [0.1, 0.4, 0.35, 0.8]
prob_B = [0.2, 0.3, 0.4, 0.7]

z_score, p_value = Delong_test(true, prob_A, prob_B)
print(f"Z-Score: {z_score}, P-Value: {p_value}")

Output:

Z-Score: 0.8660254037844385, P-Value: 0.3864762307712327

Interpretation:

Z-Score: A positive value indicates that Model A has a higher AUC than Model B. The value of 0.86600 represents the standardized difference between the two AUCs.
P-Value: A p-value of 0.38650 is greater than the typical significance level of 0.05, which means we fail to reject the null hypothesis that the two models have equal AUCs.

Conclusion:

Based on the results of DeLong’s test, although Model A’s AUC is slightly higher than Model B’s, the difference is not statistically significant. Therefore, we cannot conclude that Model A outperforms Model B in terms of AUC.

Introducing MLstatkit

To simplify the process of performing DeLong’s test, we’ve developed MLstatkit, a Python library that provides statistical tools for machine learning evaluation, including an efficient implementation of DeLong’s test.

Installing MLstatkit

Install MLstatkit using pip:

pip install MLstatkit

Using MLstatkit’s DeLong Test

Here’s how to use the Delong_test function from MLstatkit:

from MLstatkit.stats import Delong_test

# Example data
true = [0, 1, 0, 1]
prob_A = [0.1, 0.4, 0.35, 0.8]
prob_B = [0.2, 0.3, 0.4, 0.7]

# Perform DeLong's test
z_score, p_value = Delong_test(true, prob_A, prob_B)
print(f"Z-Score: {z_score}, P-Value: {p_value}")

Output:

Z-Score: 0.8660254037844385, P-Value: 0.3864762307712327

The results are consistent with the previous implementation, demonstrating that MLstatkit provides a reliable and convenient method for performing DeLong’s test.

Advantages of Using MLstatkit

Simplicity: Provides a straightforward interface for performing DeLong’s test.
Efficiency: Optimized for performance with large datasets.
Reliability: Tested and validated against standard statistical methods.

Practical Example: Comparing Two Models

Let’s demonstrate how to use MLstatkit to compare two classifiers on simulated data.

Generating Simulated Data

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Positive and negative class distributions
pos_dist = norm(loc=0.5, scale=1)
neg_dist = norm(loc=-0.5, scale=1)

# Sample sizes
n_pos = 50
n_neg = 50

# True labels
labels = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])

# Model predictions
scores_model1 = np.concatenate([pos_dist.rvs(n_pos), neg_dist.rvs(n_neg)])
scores_model2 = np.concatenate([pos_dist.rvs(n_pos), neg_dist.rvs(n_neg)])

Performing DeLong’s Test

from MLstatkit.stats import Delong_test

z_score, p_value = Delong_test(labels, scores_model1, scores_model2)

print(f"Model 1 AUC: {roc_auc_score(labels, scores_model1):.4f}")
print(f"Model 2 AUC: {roc_auc_score(labels, scores_model2):.4f}")
print(f"Z-Score: {z_score:.4f}, P-Value: {p_value:.4f}")

Output:

Model 1 AUC: 0.7180
Model 2 AUC: 0.7440
Z-Score: -0.3426, P-Value: 0.7319

Interpreting the Results

AUC Values: Both models have high AUCs, with Model 2 slightly outperforming Model 1.
Z-Score: The negative value indicates that Model 1 has a lower AUC than Model 2.
P-Value: The p-value is greater than 0.05, indicating that the difference in AUCs is not statistically significant.

Conclusion:

Based on DeLong’s test, we conclude that there is no statistically significant difference between the performances of the two models.

Conclusion

Comparing ROC curves is crucial when evaluating classifier performance. DeLong’s test offers a statistically rigorous method for determining whether differences in AUCs are significant. Implementing DeLong’s test in Python allows for automated and repeatable analysis.

MLstatkit simplifies this process, providing an accessible and efficient way to perform DeLong’s test and other statistical evaluations in machine learning workflows.

References

DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 837–845. https://doi.org/10.2307/2531595 IF: 1.4 Q2 B4
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 IF: 3.9 Q2 B3 IF: 3.9 Q2 B3

Additional Resources

MLstatkit Documentation: GitHub Repository

Comparing ROC Curves in Machine Learning Model with DeLong’s Test: A Practical Guide Using Python… was originally published in Statistics in Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.