Mastering ML Metrics: Definitions, Mathematics, and Implementation Part 1

13 min readJul 6, 2024

https://miro.medium.com/v2/resize:fit:1400/0*sbvtYlN7zb9snMOK

In the realm of machine learning, the efficacy of models is assessed through a diverse set of metrics tailored to specific tasks. These metrics serve as quantitative measures of performance, guiding the evaluation and refinement of ML algorithms across different domains. Whether classifying images, predicting numerical values, grouping data points, ranking items, forecasting time-dependent trends, or analyzing natural language, understanding and applying appropriate metrics is crucial. This article delves into the definitions, mathematical formulations, and practical implementations of metrics for classification, regression, clustering, ranking, time series analysis, and NLP, elucidating their roles in assessing model accuracy, robustness, and applicability in real-world scenarios.

Metrics for Classification Task

1. Accuracy

Accuracy measures the proportion of correctly predicted instances among the total number of instances. Suppose we want to classify the sentiments of the given sentences into two classes: ‘positive’ and ‘negative’. So for this task accuracy will be:

def accuracy(y_true, y_pred):
    return (y_true == y_pred).mean()

Same for Multiclass classification also.

2. Precision

Precision measures the proportion of true positive predictions among the positive predicted instances:

def binary_precision(y_true, y_pred):
    tp = ((y_true == 1) & (y_pred == 1)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    return tp / (tp + fp)


def multiclass_precision(y_true, y_pred):
    classes = np.unique(y_true)
    precision_scores = []
    for cls in classes:
        tp = ((y_true == cls) & (y_pred == cls)).sum()
        fp = ((y_true != cls) & (y_pred == cls)).sum()
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        precision_scores.append(precision)
    return np.mean(precision_scores)

3. Recall

Recall is used to check the sensitivity of the model. It measures the proportion of true positive predictions among the actual positive instances:

def binary_recall(y_true, y_pred):
    tp = ((y_true == 1) & (y_pred == 1)).sum()
    fn = ((y_true == 1) & (y_pred == 0)).sum()
    return tp / (tp + fn)


def multiclass_recall(y_true, y_pred):
    classes = np.unique(y_true)
    recall_scores = []
    for cls in classes:
        tp = ((y_true == cls) & (y_pred == cls)).sum()
        fn = ((y_true == cls) & (y_pred != cls)).sum()
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        recall_scores.append(recall)
    return np.mean(recall_scores)

4. F1 Score

F1 Score is the harmonic mean of Precision and Recall, providing a single score that balances both metrics:

def binary_f1_score(y_true, y_pred):
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec)


def multiclass_f1_score(y_true, y_pred):
    classes = np.unique(y_true)
    f1_scores = []
    for cls in classes:
        tp = ((y_true == cls) & (y_pred == cls)).sum()
        fp = ((y_true != cls) & (y_pred == cls)).sum()
        fn = ((y_true == cls) & (y_pred != cls)).sum()
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        f1_scores.append(f1)
    return np.mean(f1_scores)

The above four metrics are commonly used in ML because they collectively offer a comprehensive evaluation of a model’s performance with minimal redundancy.
Accuracy measures overall correctness of predictions.
Precision measures accuracy of positive predictions.
Recall measures how many actual positives were predicted correctly.
The F1 Score synthesizes Precision and Recall into a single metric, offering a balanced assessment that considers both types of errors. These metrics are crucial in differentiating model performance across various tasks and datasets, ensuring robust evaluations that are interpretable and relevant to real-world applications.

5. AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

AUC-ROC measures the area under the ROC curve, which illustrates the performance of a binary classification model across various thresholds.

def calculate_tpr_fpr(y_true, y_scores, thresholds):
    tpr = []
    fpr = []
    
    P = sum(y_true)  # Total number of positive samples
    N = len(y_true) - P  # Total number of negative samples
    
    for threshold in thresholds:
        TP = sum((y_scores >= threshold) & (y_true == 1))
        FP = sum((y_scores >= threshold) & (y_true == 0))
        TN = sum((y_scores < threshold) & (y_true == 0))
        FN = sum((y_scores < threshold) & (y_true == 1))
        
        tpr.append(TP / P)
        fpr.append(FP / N)
        
    return np.array(tpr), np.array(fpr)

def auc_roc(y_true, y_scores):
    thresholds = np.linspace(0, 1, 100)
    tpr, fpr = calculate_tpr_fpr(y_true, y_scores, thresholds)
    
    # Calculate the area using the trapezoidal rule
    auc = np.trapz(tpr, fpr)
    return auc

AUC-ROC is essential in ML for summarizing a binary classifier’s performance across all possible thresholds. It provides a single metric to compare models, indicating how well the model distinguishes between classes regardless of the decision threshold. Implementation involves plotting the ROC curve using model predictions and calculating the area under it, where higher values signify better model discrimination. This metric is robust in evaluating classifiers, especially in scenarios with imbalanced datasets and varying threshold requirements.

6. Logarithmic Loss (Log Loss)

Log Loss measures the performance of a classification model where the output is a probability value between 0 and 1.

def log_loss(y_true, y_prob):
    epsilon = 1e-15  # small value to prevent log(0)
    y_prob = np.clip(y_prob, epsilon, 1 - epsilon)  # clip probabilities to prevent log(0)
    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

Log Loss is essential for evaluating the accuracy of probabilistic predictions made by a classifier. Unlike accuracy, which only considers the correctness of predictions, Log Loss penalizes models based on the confidence of their predictions. This metric is particularly useful in binary and multi-class classification tasks where probabilities are predicted, providing a continuous measure that quantifies the divergence between predicted probabilities and actual labels.

7. Confusion Matrix

A Confusion Matrix tabulates true positive, true negative, false positive, and false negative predictions of a classification model.

def confusion_matrix(y_true, y_pred):
    classes = np.unique(y_true)
    matrix = np.zeros((len(classes), len(classes)), dtype=int)
    
    for i, actual in enumerate(y_true):
        predicted = y_pred[i]
        matrix[actual, predicted] += 1
    
    return matrix

It provides a detailed breakdown of the model’s predictions, allowing us to see not just the overall accuracy but also the types of errors the model makes, such as false positives and false negatives. This detailed insight helps in understanding the strengths and weaknesses of the model and in fine-tuning it for better performance.

8. Matthews Correlation Coefficient (MCC)

MCC is a correlation coefficient between observed and predicted binary classifications.

def matthews_corrcoef(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    
    numerator = (tp * tn) - (fp * fn)
    denominator = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    
    return numerator / denominator if denominator != 0 else 0

It is used for evaluating the quality of binary classifiers, especially when dealing with imbalanced datasets. MCC takes into account true and false positives and negatives and is generally regarded as a balanced measure, providing a more informative and truthful score than accuracy in such scenarios. It is a correlation coefficient between the observed and predicted binary classifications, ranging from -1 (total disagreement) to +1 (perfect prediction).

9. Balanced Accuracy

Balanced Accuracy is the arithmetic mean of sensitivity and specificity, adjusted for class imbalance.

def balanced_accuracy(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    
    sensitivity = tp / (tp + fn) if (tp + fn) != 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) != 0 else 0
    
    return (sensitivity + specificity) / 2

It is used for evaluating models on imbalanced datasets. Unlike regular accuracy, it accounts for both sensitivity (true positive rate) and specificity (true negative rate), providing a more equitable measure of performance across different classes.

10. Cohen’s Kappa

Cohen’s Kappa measures inter-rater agreement for categorical items.

def cohen_kappa(y_true, y_pred):
    confusion_matrix = np.zeros((2, 2))
    for t, p in zip(y_true, y_pred):
        confusion_matrix[t, p] += 1
    
    total = np.sum(confusion_matrix)
    po = np.trace(confusion_matrix) / total
    pe = (np.sum(confusion_matrix, axis=0) * np.sum(confusion_matrix, axis=1)).sum() / (total ** 2)
    
    kappa = (po - pe) / (1 - pe) if (1 - pe) != 0 else 0
    return kappa

It is used for evaluating the agreement between two raters (or a model and ground truth) while accounting for agreement occurring by chance. It provides a more robust measure than simple accuracy, especially in imbalanced datasets or when agreement by chance is high.

Metrics for Regression Task

1. Mean Absolute Error (MAE)

MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It is the average of the absolute differences between predicted and actual values.

def mean_absolute_error(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

It provides a straightforward measure of the average magnitude of prediction errors without considering their direction. MAE is intuitive and easy to interpret, making it a widely used metric for assessing model accuracy, especially when you want to understand the typical size of errors in predictions.

2. Mean Squared Error (MSE)

MSE measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. It gives more weight to larger errors, as they are squared.

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

It penalizes larger errors more than smaller ones, making it useful for highlighting significant prediction errors. This property makes MSE sensitive to outliers, providing a more robust measure of model performance when larger errors are particularly undesirable.

3. Root Mean Squared Error (RMSE)

RMSE is the square root of the average of squared differences between prediction and actual observation. It provides a measure of the average magnitude of the error, and it is in the same units as the original data.

def root_mean_squared_error(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

It provides an intuitive measure of the average prediction error magnitude in the same units as the original data. RMSE is particularly useful for understanding the typical size of prediction errors and is sensitive to larger errors, making it a valuable metric for assessing the overall performance of a model.

4. Mean Absolute Percentage Error (MAPE)

MAPE measures the accuracy of a forecasting method by calculating the average absolute percentage difference between the actual values and the predicted values. It expresses accuracy as a percentage.

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

MAPE is Useful for understanding prediction errors in a percentage format, making it easier to interpret in practical terms.

5. R-squared (Coefficient of Determination)

R-squared (R²) measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It provides an indication of the goodness of fit of a model.

def r_squared(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_res / ss_tot)

R² indicates the proportion of variance explained by the model, useful for assessing the overall fit.

6. Adjusted R-squared

Adjusted R-squared adjusts the R-squared value based on the number of predictors in the model. It accounts for the model complexity by including a penalty for adding more variables that do not improve the model significantly.

def adjusted_r_squared(y_true, y_pred, n, k):
    r2 = r_squared(y_true, y_pred)
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

It provides a more accurate measure of model fit by penalizing the addition of unnecessary predictors, ensuring a more robust model evaluation.

7. Median Absolute Error (MedAE)

MedAE is a robust metric for measuring the average magnitude of errors in a set of predictions. Unlike Mean Absolute Error (MAE), which computes the average of absolute errors, MedAE calculates the median of absolute errors. It is less sensitive to outliers in the data.

def median_absolute_error(y_true, y_pred):
    return np.median(np.abs(y_true - y_pred))

It demonstrates how to use the function with a set of actual and predicted values.

Metrics for Clustering Task

1. Silhouette Score

The Silhouette Score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 to +1, where a higher value indicates that clusters are well-separated, and objects are well-matched to their own cluster.

def silhouette_score(X, labels):
    n = len(X)
    a = np.zeros(n)
    b = np.zeros(n)
    
    for i in range(n):
        cluster_label = labels[i]
        cluster_points = X[labels == cluster_label]
        a[i] = np.mean(pairwise_distances(X[i].reshape(1, -1), cluster_points))
        
        min_dist = np.inf
        for j in range(len(np.unique(labels))):
            if j != cluster_label:
                other_cluster_points = X[labels == j]
                dist = np.mean(pairwise_distances(X[i].reshape(1, -1), other_cluster_points))
                if dist < min_dist:
                    min_dist = dist
        b[i] = min_dist
    
    sil_scores = (b - a) / np.maximum(a, b)
    return np.mean(sil_scores)

It provides insights into the separation and compactness of clusters in your data.

2. Davies-Bouldin Index

DBI is a metric used to evaluate the quality of clustering in unsupervised learning. It measures the average similarity between each cluster and its most similar cluster, taking into account both the scatter (variance) within clusters and the separation between clusters. A lower DBI indicates better clustering.

def davies_bouldin_index(X, labels):
    n_clusters = len(np.unique(labels))
    cluster_centers = np.array([np.mean(X[labels == i], axis=0) for i in range(n_clusters)])
    sigma = np.zeros(n_clusters)
    R = np.zeros((n_clusters, n_clusters))
    
    for i in range(n_clusters):
        cluster_points = X[labels == i]
        centroid_i = cluster_centers[i]
        sigma[i] = np.mean(pairwise_distances(cluster_points, [centroid_i]))
    
    for i in range(n_clusters):
        for j in range(n_clusters):
            if i != j:
                centroid_i = cluster_centers[i]
                centroid_j = cluster_centers[j]
                distance_ij = np.linalg.norm(centroid_i - centroid_j)
                R[i, j] = (sigma[i] + sigma[j]) / distance_ij
    
    db_index = np.mean(np.max(R, axis=1))
    
    return db_index

It provides insights into the separation and compactness of clusters in your data.

3. Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz Index measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion. It provides a measure of cluster compactness and separation.

def calinski_harabasz_index(X, labels):
    k = len(np.unique(labels))
    centroids = []
    scatter_within = []
    scatter_between = []
    
    for i in range(k):
        cluster_points = X[labels == i]
        centroids.append(np.mean(cluster_points, axis=0))
        scatter_within.append(np.mean(pairwise_distances(cluster_points, [centroids[-1]])))
    
    overall_centroid = np.mean(X, axis=0)
    for i in range(k):
        scatter_between.append(len(cluster_points[i]) * pairwise_distances([centroids[i]], [overall_centroid]) ** 2)
    
    ch_index = np.sum(scatter_between) / np.sum(scatter_within) * (len(X) - k) / (k - 1)
    return ch_index

It provides insights into the separation and compactness of clusters in your data.

4. Adjusted Rand Index (ARI)

The ARI is a measure of the similarity between two data clusterings. It adjusts the Rand Index to account for the chance grouping of elements, providing a more accurate evaluation of clustering performance. The ARI ranges from -1 to 1:
An ARI of 1 indicates perfect agreement between the two clusterings.
An ARI of 0 indicates that the clustering is random.
An ARI less than 0 indicates that the clustering is worse than random.

def adjusted_rand_index(labels_true, labels_pred):
    # Create contingency table
    contingency = np.zeros((len(np.unique(labels_true)), len(np.unique(labels_pred))), dtype=int)
    for i, label_true in enumerate(np.unique(labels_true)):
        for j, label_pred in enumerate(np.unique(labels_pred)):
            contingency[i, j] = np.sum((labels_true == label_true) & (labels_pred == label_pred))
    
    # Compute sums for the contingency table
    a = np.sum(contingency, axis=1)
    b = np.sum(contingency, axis=0)
    n = np.sum(contingency)
    
    # Compute the components of the ARI formula
    sum_comb_nij = np.sum([comb(nij, 2) for nij in contingency.flatten()])
    sum_comb_a = np.sum([comb(ai, 2) for ai in a])
    sum_comb_b = np.sum([comb(bj, 2) for bj in b])
    comb_n = comb(n, 2)
    
    # Calculate the ARI
    index = sum_comb_nij - (sum_comb_a * sum_comb_b) / comb_n
    expected_index = (sum_comb_a + sum_comb_b) / 2 - (sum_comb_a * sum_comb_b) / comb_n
    max_index = (sum_comb_a + sum_comb_b) / 2
    
    ARI = (index - expected_index) / (max_index - expected_index)
    
    return ARI

It provides insights into the similarity between different clustering solutions.

5. Homogeneity, Completeness, and V-measure

Homogeneity measures whether each cluster contains only members of a single class. A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
Completeness measures whether all members of a given class are assigned to the same cluster. A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
V-measure is the harmonic mean of homogeneity and completeness. It provides a balanced assessment by combining both metrics.

def homogeneity_completeness_vmeasure(labels_true, labels_pred):
    def entropy(labels):
        _, counts = np.unique(labels, return_counts=True)
        probs = counts / counts.sum()
        return -np.sum(probs * np.log(probs))

    def conditional_entropy(labels_true, labels_pred):
        unique_labels_true, unique_labels_pred = np.unique(labels_true), np.unique(labels_pred)
        conditional_entropy = 0.0
        for c in unique_labels_pred:
            sub_labels_true = labels_true[labels_pred == c]
            conditional_entropy += (len(sub_labels_true) / len(labels_true)) * entropy(sub_labels_true)
        return conditional_entropy

    H_C = entropy(labels_pred)
    H_K = entropy(labels_true)
    H_K_given_C = conditional_entropy(labels_true, labels_pred)
    H_C_given_K = conditional_entropy(labels_pred, labels_true)
    
    homogeneity = 1 - H_K_given_C / H_K if H_K != 0 else 1.0
    completeness = 1 - H_C_given_K / H_C if H_C != 0 else 1.0
    v_measure = 2 * (homogeneity * completeness) / (homogeneity + completeness) if (homogeneity + completeness) != 0 else 0.0
    
    return homogeneity, completeness, v_measure

This gives insights into the quality and structure of the clusters.

Next Article: Mastering ML Metrics Part 2

Mastering ML Metrics: Definitions, Mathematics, and Implementation Part 1

Metrics for Classification Task

1. Accuracy

2. Precision

3. Recall

4. F1 Score

5. AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

6. Logarithmic Loss (Log Loss)

7. Confusion Matrix

8. Matthews Correlation Coefficient (MCC)

9. Balanced Accuracy

10. Cohen’s Kappa

Metrics for Regression Task

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)

3. Root Mean Squared Error (RMSE)

4. Mean Absolute Percentage Error (MAPE)

5. R-squared (Coefficient of Determination)

6. Adjusted R-squared

7. Median Absolute Error (MedAE)

Metrics for Clustering Task

1. Silhouette Score

2. Davies-Bouldin Index

3. Calinski-Harabasz Index (Variance Ratio Criterion)

4. Adjusted Rand Index (ARI)

5. Homogeneity, Completeness, and V-measure

Written by Ebad Sayed