Mastering ML Metrics: Definitions, Mathematics, and Implementation Part 2

13 min readJul 6, 2024

https://miro.medium.com/v2/resize:fit:1400/0*sbvtYlN7zb9snMOK

In the previous article we saw different metrics used for Classification, Regression and Clustering Tasks. In this article we will see Ranking, Time-series and NLP metrics.

Previous Article: Mastering ML Metrics Part 1

Ranking Metrics

Ranking metrics are used to evaluate the performance of algorithms that produce a ranked list of items, such as search engines or recommendation systems. These metrics help measure how well the ranking matches the relevance of the items to the user or query, ensuring that the most relevant items appear at the top.

1. Mean Average Precision (MAP)

MAP is a metric used to evaluate the performance of ranking algorithms. It is the mean of the average precision scores for each query. Average precision (AP) measures the precision of the relevant items across the ranked list of results for a single query.

def precision_at_k(r, k):
    assert k >= 1
    r = np.asarray(r)[:k]
    return np.mean(r)

def average_precision(r):
    r = np.asarray(r)
    out = [precision_at_k(r, k + 1) for k in range(len(r)) if r[k]]
    if not out:
        return 0.
    return np.mean(out)

def mean_average_precision(rs):
    return np.mean([average_precision(r) for r in rs])

It provides insights into the accuracy and relevance of the ranked results for multiple queries.

2. Normalized Discounted Cumulative Gain (NDCG)

NDCG) is a metric used to evaluate the performance of ranking algorithms, particularly in information retrieval. It measures the usefulness or gain of a document based on its position in the result list, discounting the gain logarithmically as the position of the document increases.

def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum((2**r - 1) / np.log2(np.arange(1, r.size + 1) + 1))
    return 0.0

def ndcg_at_k(r, k):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    if not dcg_max:
        return 0.0
    return dcg_at_k(r, k) / dcg_max

def mean_ndcg(rs, k):
    return np.mean([ndcg_at_k(r, k) for r in rs])

It provides insights into the relevance and usefulness of the ranked results for multiple queries.

3. Mean Reciprocal Rank (MRR)

MRR is a metric used to evaluate the performance of a ranking algorithm, particularly in information retrieval and question answering systems. MRR is the average of the reciprocal ranks of the first relevant item for a set of queries. The reciprocal rank is the multiplicative inverse of the rank of the first relevant item.

def reciprocal_rank(relevance_scores):
    for i, score in enumerate(relevance_scores):
        if score == 1:
            return 1 / (i + 1)
    return 0.0

def mean_reciprocal_rank(relevance_scores_list):
    rr_scores = [reciprocal_rank(scores) for scores in relevance_scores_list]
    return sum(rr_scores) / len(rr_scores)

It provides insights into the effectiveness of the ranking algorithm in retrieving the first relevant item across multiple queries.

4. Precision at K

Precision at K P(K) is a metric used to evaluate the performance of a ranking algorithm, specifically measuring the precision of the top K items in the ranked list. It is defined as the proportion of relevant items among the top K items returned by the algorithm.

def precision_at_k(relevance_scores, k):
    relevance_scores = relevance_scores[:k]  # Consider only the top K items
    return sum(relevance_scores) / k

It provides insights into how accurately the algorithm ranks the top K items.

5. F-beta Score

The F-beta score is a metric that combines precision and recall into a single metric, weighted by the parameter β. It is useful when there is an uneven class distribution, and either precision or recall is more important to optimize. The β parameter determines the weighting of precision in the combined score. When β=1, it is the harmonic mean of precision and recall, commonly known as the F1 score.

def f_beta_score(y_true, y_pred, beta):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    
    beta_squared = beta ** 2
    f_beta = (1 + beta_squared) * (precision * recall) / (beta_squared * precision + recall) if (beta_squared * precision + recall) > 0 else 0.0
    
    return f_beta

6. Hamming Loss

The Hamming Loss measures the fraction of labels that are incorrectly predicted. It is used for multi-label classification problems where each instance can be associated with multiple labels. The Hamming Loss is calculated as the fraction of labels that are incorrectly predicted, averaged over all instances and labels.

def hamming_loss(y_true, y_pred):
    N = len(y_true)
    L = len(y_true[0])
    hamming_loss = 0.0
    
    for i in range(N):
        for j in range(L):
            if y_true[i][j] != y_pred[i][j]:
                hamming_loss += 1
    
    hamming_loss /= (N * L)
    return hamming_loss

Time-Series Metrics

Time-series metrics are crucial for evaluating predictive models on time-dependent data. They capture performance nuances like trend accuracy, forecasting errors, and seasonality adjustments, which traditional metrics might overlook in static datasets.

1. Mean Absolute Scaled Error (MASE)

MASE measures the relative accuracy of a forecasting method by comparing its forecast errors to the forecast errors of a naive method (often the seasonal naive method). It is robust to scaling and measures accuracy relative to a baseline.

def mean_absolute_scaled_error(y_true, y_pred, y_naive):
    # Calculate forecast errors
    forecast_errors = [abs(y_true[i] - y_pred[i]) for i in range(len(y_true))]
    
    # Calculate mean absolute forecast error of the model
    mean_absolute_error = sum(forecast_errors) / len(y_true)
    
    # Calculate mean absolute error of the naive method
    naive_errors = [abs(y_true[i] - y_naive[i]) for i in range(1, len(y_true))]  # exclude the first value
 
    mean_naive_error = sum(naive_errors) / (len(y_true) - 1)
    
    # Calculate MASE
    mase = mean_absolute_error / mean_naive_error
    
    return mase

This metric is valuable because it provides a standardized way to compare the accuracy of different forecasting models across different time-series datasets, taking into account the nature of the data and the forecasting horizon.

2. Symmetric Mean Absolute Percentage Error (SMAPE)

SMAPE measures the accuracy of forecasts relative to the magnitude of the actual values. It handles both small and large values equally and provides a balanced view of the forecast accuracy.

def symmetric_mean_absolute_percentage_error(y_true, y_pred):
    assert len(y_true) == len(y_pred), "Length of y_true and y_pred must be the same."
    
    n = len(y_true)
    sum_abs_diff = 0.0
    sum_abs_true = 0.0
    
    for i in range(n):
        true_val = y_true[i]
        pred_val = y_pred[i]
        
        sum_abs_diff += abs(true_val - pred_val)
        sum_abs_true += (abs(true_val) + abs(pred_val))
    
    smape = (sum_abs_diff / sum_abs_true) * 100.0 / n
    
    return smape

3. Mean Squared Logarithmic Error (MSLE)

MSLE measures the average squared difference between the natural logarithm of the predicted values and the natural logarithm of the true values. It penalizes underestimations more than overestimations.

def mean_squared_logarithmic_error(y_true, y_pred):
    assert len(y_true) == len(y_pred), "Length of y_true and y_pred must be the same."
    
    n = len(y_true)
    sum_sq_log_diff = 0.0
    
    for i in range(n):
        true_val = y_true[i]
        pred_val = y_pred[i]
        
        log_diff = np.log(true_val + 1) - np.log(pred_val + 1)
        sum_sq_log_diff += (log_diff ** 2)
    
    msle = sum_sq_log_diff / n
    
    return msle

4. Dynamic Time Warping (DTW)

DTW measures the similarity between two sequences by finding the optimal alignment that minimizes the cumulative distance between corresponding points in the sequences, allowing for local variations in the alignment.

def dtw_distance(A, B):
    # Length of sequences A and B
    n_A = len(A)
    n_B = len(B)
    
    # Compute the distance matrix
    D = np.zeros((n_A, n_B))
    
    # Initialize the first row and first column of the distance matrix
    D[0, 0] = np.abs(A[0] - B[0])
    for i in range(1, n_A):
        D[i, 0] = D[i-1, 0] + np.abs(A[i] - B[0])
    for j in range(1, n_B):
        D[0, j] = D[0, j-1] + np.abs(A[0] - B[j])
    
    # Fill the rest of the distance matrix
    for i in range(1, n_A):
        for j in range(1, n_B):
            cost = np.abs(A[i] - B[j])
            D[i, j] = cost + min(D[i-1, j], D[i, j-1], D[i-1, j-1])
    
    # Return the DTW distance (bottom-right corner of the matrix)
    dtw_dist = D[n_A-1, n_B-1]
    
    return dtw_dist

5. Dimensionality Reduction Metrics

Dimensionality reduction metrics assess the quality of reduced-dimensional representations of data compared to the original data. Two common metrics are Variance Explained Ratio (VER) and Reconstruction Error (RE). VER measures the proportion of variance in the original data that is retained in the reduced-dimensional representation. It is often used in Principal Component Analysis (PCA) and related techniques.

def variance_explained_ratio(X, X_reduced):
    var_total = np.var(X)
    var_reduced = np.var(X_reduced)
    ver = var_reduced / var_total
    return ver


def reconstruction_error(X, X_reconstructed):
    n = X.shape[0]
    squared_diff = np.sum((X - X_reconstructed) ** 2)
    re = squared_diff / n

NLP Metrics

1. BLEU (Bilingual Evaluation Understudy)

BLEU is a metric used to evaluate the quality of machine-translated text relative to one or more reference translations. It measures the similarity between the candidate translation and reference translations based on n-grams (contiguous sequences of n words) and computes a precision score modified by brevity penalty to account for different lengths of translations. It evaluates the translation quality by comparing the n-grams in the candidate translation to those in the reference translations, considering both precision and brevity.

def calculate_bleu(candidate, references, max_n=4):
    def ngrams(tokens, n):
        ngram_counts = collections.Counter(tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1))
        return ngram_counts
    
    def clipped_precision(candidate_ngrams, reference_ngrams):
        clipped_counts = {ngram: min(count, reference_ngrams[ngram]) for ngram, count in candidate_ngrams.items()}
        clipped_precision = sum(clipped_counts.values()) / max(1, sum(candidate_ngrams.values()))
        return clipped_precision
    
    candidate_ngrams = {n: ngrams(candidate, n) for n in range(1, max_n + 1)}
    reference_ngrams = [{n: ngrams(ref, n) for n in range(1, max_n + 1)} for ref in references]
    
    precision_scores = []
    for n in range(1, max_n + 1):
        candidate_ngram = candidate_ngrams[n]
        reference_ngram = [ref_ngram[n] for ref_ngram in reference_ngrams]
        
        candidate_count = sum(candidate_ngram.values())
        closest_ref_count = max([sum(ref_ngram.values()) for ref_ngram in reference_ngram])
        
        if candidate_count == 0:
            precision_scores.append(0.0)
        else:
            precision_scores.append(clipped_precision(candidate_ngram, reference_ngram))
    
    geo_mean = math.exp(sum(math.log(p) for p in precision_scores) / max_n)
    
    candidate_length = len(candidate)
    closest_ref_length = min((len(ref) for ref in references), key=lambda ref_len: (abs(ref_len - candidate_length), ref_len))
    
    if candidate_length > closest_ref_length:
        brevity_penalty = 1
    else:
        brevity_penalty = math.exp(1 - closest_ref_length / candidate_length)
    
    bleu_score = brevity_penalty * geo_mean
    
    return bleu_score

candidate = ["the", "cat", "is", "on", "the", "mat"]
references = [
    ["the", "cat", "is", "on", "the", "rug"],
    ["there", "is", "a", "cat", "on", "the", "mat"]
]

bleu = calculate_bleu(candidate, references)
print(f"BLEU score: {bleu:.4f}")

2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is a set of metrics used to evaluate the quality of summaries by comparing them to reference summaries. It focuses on recall, measuring how much of the reference summary is captured by the candidate summary. It measures the overlap between the candidate summary and reference summaries based on n-grams and longest common subsequences (LCS).

def rouge_n(candidate, references, n=1):
    def ngrams(tokens, n):
        ngram_counts = collections.Counter(tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1))
        return ngram_counts
    
    candidate_ngrams = ngrams(candidate, n)
    reference_ngrams = [ngrams(ref, n) for ref in references]
    
    total_matched_ngrams = sum(min(candidate_ngrams[ngram], max(ref_ngrams[ngram] for ref_ngrams in reference_ngrams)) for ngram in candidate_ngrams)
    total_reference_ngrams = sum(sum(ref_ngrams.values()) for ref_ngrams in reference_ngrams)
    
    rouge_score = total_matched_ngrams / total_reference_ngrams if total_reference_ngrams > 0 else 0.0
    return rouge_score



def rouge_l(candidate, references):
    def lcs_length(tokens1, tokens2):
        dp = [[0] * (len(tokens2) + 1) for _ in range(len(tokens1) + 1)]
        
        for i in range(1, len(tokens1) + 1):
            for j in range(1, len(tokens2) + 1):
                if tokens1[i - 1] == tokens2[j - 1]:
                    dp[i][j] = dp[i - 1][j - 1] + 1
                else:
                    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
        
        return dp[len(tokens1)][len(tokens2)]
    
    total_lcs = sum(lcs_length(candidate, ref) for ref in references)
    total_reference_length = sum(len(ref) for ref in references)
    
    rouge_score = total_lcs / total_reference_length if total_reference_length > 0 else 0.0
    return rouge_score

candidate_summary = ["the", "cat", "is", "on", "the", "mat"]
reference_summaries = [
    ["the", "cat", "is", "on", "the", "rug"],
    ["there", "is", "a", "cat", "on", "the", "mat"]
]

# Calculate ROUGE-N (unigrams)
rouge_n_score = rouge_n(candidate_summary, reference_summaries, n=1)
print(f"ROUGE-N score: {rouge_n_score:.4f}")

# Calculate ROUGE-L
rouge_l_score = rouge_l(candidate_summary, reference_summaries)
print(f"ROUGE-L score: {rouge_l_score:.4f}")

3. METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR is a metric used for evaluating the quality of machine translation outputs. It incorporates measures of precision, recall, and alignment accuracy based on exact matching of words and stemming, along with penalty scores for word order differences. It evaluates the similarity between a candidate translation and reference translations by computing a harmonic mean of precision and recall, adjusted for exact matches and word order.

import nltk
from nltk.stem import PorterStemmer
from nltk.translate.meteor_score import meteor_score

def calculate_meteor(candidate, references, alpha=0.5):
    stemmer = PorterStemmer()
    
    def calculate_precision(candidate_tokens, reference_tokens):
        candidate_stems = set(stemmer.stem(token) for token in candidate_tokens)
        reference_stems = set(stemmer.stem(token) for token in reference_tokens)
        
        common_stems = candidate_stems.intersection(reference_stems)
        precision = len(common_stems) / len(candidate_stems) if len(candidate_stems) > 0 else 0.0
        return precision
    
    def calculate_recall(candidate_tokens, reference_tokens):
        candidate_stems = set(stemmer.stem(token) for token in candidate_tokens)
        reference_stems = set(stemmer.stem(token) for token in reference_tokens)
        
        common_stems = candidate_stems.intersection(reference_stems)
        recall = len(common_stems) / len(reference_stems) if len(reference_stems) > 0 else 0.0
        return recall
    
    candidate_tokens = nltk.word_tokenize(candidate.lower())
    reference_tokens = [nltk.word_tokenize(ref.lower()) for ref in references]
    
    precision = calculate_precision(candidate_tokens, reference_tokens)
    recall = calculate_recall(candidate_tokens, reference_tokens)
    
    meteor = (precision * recall) / ((1 - alpha) * recall + alpha * precision) if (precision + recall) > 0 else 0.0
    
    return meteor

candidate_translation = "the cat is on the mat"
reference_translations = [
    "the cat is lying on the mat",
    "there is a cat on the mat"
]

meteor_score = calculate_meteor(candidate_translation, reference_translations)
print(f"METEOR score: {meteor_score:.4f}")

4. NIST (Normalized Information Retrieval)

NIST is a metric used to evaluate machine translation outputs by comparing n-gram matches between candidate and reference translations. It considers precision at multiple n-gram orders and normalizes scores based on the reference translations. It evaluates the quality of machine translation by computing the precision of n-grams (contiguous sequences of n words) in the candidate translation compared to reference translations. It provides a normalized score based on the precision at different n-gram orders.

def calculate_nist(candidate, references, max_n=4):
    def ngrams(tokens, n):
        ngram_counts = collections.Counter(tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1))
        return ngram_counts
    
    candidate_ngrams = {n: ngrams(candidate, n) for n in range(1, max_n + 1)}
    reference_ngrams = [{n: ngrams(ref, n) for n in range(1, max_n + 1)} for ref in references]
    
    precisions = []
    for n in range(1, max_n):
        candidate_ngram = candidate_ngrams[n]
        next_candidate_ngram = candidate_ngrams[n + 1]
        
        total_matched_ngrams = sum(min(candidate_ngram[ngram], max(ref_ngrams[ngram] for ref_ngrams in reference_ngrams)) for ngram in candidate_ngram)
        next_total_matched_ngrams = sum(min(next_candidate_ngram[ngram], max(ref_ngrams[ngram] for ref_ngrams in reference_ngrams)) for ngram in next_candidate_ngram)
        
        precision_n = total_matched_ngrams / max(1, sum(candidate_ngram.values()))
        precision_next = next_total_matched_ngrams / max(1, sum(next_candidate_ngram.values()))
        
        if precision_next > 0:
            precisions.append(math.log(precision_n / precision_next))
    
    if len(precisions) > 0:
        nist_score = math.exp(sum(precisions) / (max_n - 1))
    else:
        nist_score = 0.0
    
    return nist_score

candidate_translation = ["the", "cat", "is", "on", "the", "mat"]
reference_translations = [
    ["the", "cat", "is", "lying", "on", "the", "mat"],
    ["there", "is", "a", "cat", "on", "the", "mat"]
]

nist_score = calculate_nist(candidate_translation, reference_translations)
print(f"NIST score: {nist_score:.4f}")

5. Word Error Rate (WER) & Character Error Rate (CER)

WER measures the number of errors (substitutions, deletions, insertions) made by an automatic speech recognition or text generation system compared to a reference text, normalized by the number of words in the reference text. While in CER it is normalized by the number of characters in the reference text.

def calculate_wer(candidate, reference):
    n = len(reference)
    m = len(candidate)
    
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    for i in range(1, n + 1):
        dp[i][0] = i
    
    for j in range(1, m + 1):
        dp[0][j] = j
    
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if reference[i - 1] == candidate[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + 1)
    
    wer = dp[n][m] / n
    return wer



def calculate_cer(candidate, reference):
    n = len(reference)
    m = len(candidate)
    
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    for i in range(1, n + 1):
        dp[i][0] = i
    
    for j in range(1, m + 1):
        dp[0][j] = j
    
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if reference[i - 1] == candidate[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + 1)
    
    cer = dp[n][m] / n
    return cer

candidate_text = "the cat is on the mat"
reference_text = "the cat is lying on the mat"

wer_score = calculate_wer(candidate_text, reference_text)
print(f"WER score: {wer_score:.4f}")

cer_score = calculate_cer(candidate_text, reference_text)
print(f"CER score: {cer_score:.4f}")

6. Cosine Similarity

Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space. It measures the cosine of the angle between them. It measures the cosine of the angle between two vectors, indicating the degree of similarity (1 means identical, 0 means orthogonal, -1 means diametrically opposite).

def cosine_similarity(vector1, vector2):
    dot_product = sum(a * b for a, b in zip(vector1, vector2))
    norm1 = math.sqrt(sum(a ** 2 for a in vector1))
    norm2 = math.sqrt(sum(b ** 2 for b in vector2))
    
    similarity = dot_product / (norm1 * norm2)
    
    return similarity

Summary

In this article and in the previous one, we explored fundamental evaluation metrics used in machine learning. Each metric provides a mathematical framework to assess model performance objectively. By understanding these metrics and their implementations in Python, practitioners gain valuable tools to optimize and interpret machine learning models effectively.