Different Loss Methods for the Recommender Systems

9 min readJul 4, 2024

credit: https://realpython.com/build-recommendation-engine-collaborative-filtering/

What is Loss in Machine Learning?

Loss methods in machine learning are ways to measure how wrong a model’s predictions are. They act like a report card, giving lower scores (losses) for better predictions and higher scores for worse ones. These methods guide the learning process by providing a feedback to the model where it’s making mistakes and how big those mistakes are. By trying to minimize this loss score, the model gradually improves its performance. Different problems (like classification or recommendation) use different loss methods tailored to their specific goals. Ultimately, loss methods are the model’s compass, directing it towards better performance through continuous feedback and changes.

In this post, we will learn about the different loss methods used in the recommender systems and implement them in python using Numpy.

Loss Methods for Recommendation Models

Following are the most commonly used loss methods for the recommender systems:

1) Point-wise Loss

Point-wise loss functions treat each user-item interaction as an independent prediction problem. It aims to predict the exact rating or preference score for each user-item pair. This is useful when we get explicit ranking or feedback from the user (for e.g., # of ⭐️ after watching a youTube video). Mean Square Error (MSE) is the most commonly used loss method for this.

Suppose we have the following predicted & actual ratings:

Predicted ratings: [4.2, 3.8, 2.5]
Actual ratings: [4.0, 3.5, 3.0]

We would calculate the MSE as:

def pointwise_mse_loss(y_true, y_pred):
   """
   Computes Mean Squared Error (MSE) loss
   """
   return np.mean((y_true - y_pred) ** 2)

# Example usage
y_true_ratings = np.array([4.0, 3.5, 3.0])
y_pred_ratings = np.array([4.2, 3.8, 2.5])

loss = pointwise_mse_loss(y_true, y_pred)
print(f"Pointwise MSE Loss: {loss}")

2) Pair-wise Loss

Pair-wise loss functions focus on the relative ordering of item pairs. They aim to rank items relative to each other for a user, rather than predicting absolute scores.

2.1) Logistic Loss

The pair-wise logistic loss is used when the user has provided explicit feedback for items (e.g., movies) and we want to learn a model that can correctly rank pairs of movies. Instead of predicting absolute scores, it focuses on the relative order of items.

In the context of recommender systems:

We consider pairs of items (i, j) for a given user
Item i is preferred over item j (i.e., should be ranked higher)
The model predicts scores for both items
The loss function encourages the score of item i to be higher than the score of item j

The intuition behind this loss function is as follows:

We only consider pairs where y_i > y_j, i.e., where item i should be ranked higher than item j according to the ground truth. For these pairs, we want the predicted rankings; s_i > s_j (since our model should predict a higher score for the item that should be ranked higher).

If s_i is much larger than s_j, then exp(-(s_i — s_j)) will be close to 0, and log(1 + exp(-(s_i — s_j))) will be close to 0, resulting in a small loss. If s_i is close to or less than s_j, then exp(-(s_i — s_j)) will be close to or greater than 1, resulting in a larger loss. The log function helps to dampen extremely large losses and make the optimization more stable.

Model Loss Vs Difference in Predicted Ratings — Model Loss Vs Predicted Ratings Difference

This loss function encourages the model to learn to rank items correctly by penalizing incorrectly ordered pairs.

Let’s break down the formula:

loss = sum_i sum_j I[y_i > y_j] * log(1 + exp(-(s_i — s_j)))

Explanation:

sum_i sum_j: This shows a double summation over all pairs of items i and j in the dataset.
I[y_i > y_j]: This is an indicator function. It equals 1 if y_i > y_j, and 0 otherwise. Here, y_i and y_j represent the true relevance scores or rankings of items i and j.
s_i and s_j: These are the predicted scores for items i and j from your ranking model.
exp(-(s_i — s_j)): This term computes the exponential of the negative difference between the predicted scores.
log(1 + exp(-(s_i — s_j))): This is the logistic loss for a pair of items.

def pairwise_logistic_loss(y_true, y_pred) -> float:
    """
    Calculate the pairwise logistic loss for a ranking problem.
    It penalizes incorrectly ordered pairs of items based on their true and predicted scores.

    The loss is calculated as:
    loss = sum_i sum_j I[y_true_i > y_true_j] * log(1 + exp(-(y_pred_i - y_pred_j)))

    where I[condition] is an indicator function that equals 1 when the condition is true and 0 otherwise.
    """

    if len(y_true) != len(y_pred):
        raise ValueError("y_true and y_pred must have the same length")

    n = len(y_true)
    loss = 0.0
    print("\nPair-wise contributions:")
    for i in range(n):
        for j in range(n):
            if y_true[i] > y_true[j]:
                pair_loss = np.log(1 + np.exp(-(y_pred[i] - y_pred[j])))
                loss += pair_loss
                print(f"Pair ({i}, {j}) with preds ({y_pred[i]},{y_pred[j]}): Loss = {pair_loss:.4f}")

    return loss

# Example data
y_true = np.array([10, 2, 4, 1])  # True relevance scores
s_pred = np.array([12.5, 2.0, 3.5, 1.5])  # Predicted scores

# Calculate loss
loss = pairwise_logistic_loss(y_true, s_pred)

print(f"Pairwise Logistic Loss: {loss}")

Pair-wise contributions:
Pair (0, 1) with preds (12.5,2.0): Loss = 0.0000
Pair (0, 2) with preds (12.5,3.5): Loss = 0.0001
Pair (0, 3) with preds (12.5,1.5): Loss = 0.0000
Pair (1, 3) with preds (2.0,1.5): Loss = 0.4741
Pair (2, 1) with preds (3.5,2.0): Loss = 0.2014
Pair (2, 3) with preds (3.5,1.5): Loss = 0.1269
Pairwise Logistic Loss: 0.8025859130271021

2.2) Bayesian Personalized Ranking (BPR) Loss

Bayesian Personalized Ranking loss is another pair-wise ranking loss function commonly used in recommender systems. It’s useful for implicit feedback scenarios, where we don’t have explicit ratings but implicit indications of user preferences (e.g., clicks, views, purchases).

BPR optimizes ranking rather than predicting absolute ratings. It assumes that the observed (positive) items should be ranked higher than unobserved items for a user. It’s based on a Bayesian analysis of the pair-wise ranking problem.

The basic idea of BPR is to maximize the probability that a user prefers an observed item over an unobserved item. Mathematically, for a user u, an observed item i, and an unobserved item j.

The BPR loss function is then defined as:
L = -ln(σ(x̂_uij)) + λ||Θ||²
Where λ is a regularization parameter and ||Θ||² is the L2 norm of the model parameters.

Why is it called Bayesian?

The “Bayesian” in BPR refers to the use of a Bayesian analysis of the problem statement, rather than the use of Bayesian inference techniques. The authors [1] formulate the personalized ranking task as a maximum posterior estimation problem.

Posterior Probability: BPR aims to maximize the posterior probability of the personalized rankings. In Bayesian terms, it’s trying to find the most probable ranking given the observed data.
Prior and Likelihood:The method implicitly defines a prior over the parameters and a likelihood function for the observed pair-wise preferences.
Maximum A Posteriori (MAP) Estimation: BPR uses a maximum a posteriori (MAP) estimation approach, which is a Bayesian concept. MAP estimation aims to find the mode of the posterior distribution, rather than computing the full posterior distribution.
Probabilistic Interpretation: The sigmoid function used in BPR can be interpreted as the probability of one item being ranked higher than another, which aligns with Bayesian probabilistic thinking.

Mathematical Formulation:

Let Θ be the model parameters and >u be the personalized total ranking for a user u. The goal is to maximize:

p(Θ | >u) ∝ p(>u | Θ) p(Θ)

Where:

Θ: The model parameters (e.g., latent factors in matrix factorization)
u: The personalized total ranking for user u
p(Θ | >u): The posterior probability of the parameters given the observed rankings
p(>u | Θ): The likelihood of the observed rankings given the parameters
p(Θ): The prior probability of the parameters
∝: Proportional to (we often ignore the normalizing constant)

The goal is to find the parameters Θ that maximize the posterior probability p(Θ | >u).

Taking the logarithm: To simplify calculations, we often work with the log of this probability. Taking logs of both sides:
log p(Θ | >u) = log p(>u | Θ) + log p(Θ)
Likelihood term: The likelihood p(>u | Θ) is modeled as a product of individual pair-wise preferences: p(>u | Θ) = ∏(u,i,j)∈DS p(i >u j | Θ)
Where DS is the set of all (user, positive item, negative item) triples.
Modeling individual preferences: Each pair-wise preference is modeled using the sigmoid function: p(i >u j | Θ) = σ(x̂_uij(Θ)); where, x̂_uij(Θ) is the dot product between the user and difference between positive & negative items
Prior term: The prior p(Θ) is typically modeled as a normal distribution, which in log form becomes proportional to the negative L2 norm of the parameters.
The log posterior becomes:
log p(Θ | >u) ∝ ∑(u,i,j)∈DS log σ(x̂_uij(Θ)) — λ||Θ||²

To convert this to a loss function (which we want to minimize), we negate it: L(Θ) = -∑(u,i,j)∈DS log σ(x̂_uij(Θ)) + λ||Θ||²

Below is a simplified implementation for the BPR loss:

import numpy as np
from typing import List

def sigmoid(x: float):
    """
    Function to compute sigmoid
    """
    return 1 / (1 + np.exp(-x))

def bpr_loss(user: List[float], pos_item: List[float], neg_item: List[float]):
    """
    Function to compute BPR Loss

    Params
    -------
    user: user latent factor learned during the training.
    pos_item: predicted rating for the postive item
    neg_item: predicted rating for the negative item
    """
    x_uij = np.dot(user, pos_item - neg_item)
    return -np.log(sigmoid(x_uij))

def l2_reg(params: List[float]):
    """
    Function to compute the L2 Regularization
    """
    return np.sum(params**2)

def compute_loss(user, items, pos_item, neg_item, lambda_reg):
    pos_ranks = items[pos_item]
    neg_ranks = items[neg_item]
    loss = bpr_loss(user, pos_ranks, neg_ranks)
    # calculate l2 for each param
    reg = lambda_reg * (np.sum([l2_reg(p) for p in [user, pos_ranks, neg_ranks]]))
    return loss + reg

lambda_reg = 0.01

# user and item factors learned during the Training phase
user = np.array([0.1,0.3])
items = {
    'M': np.array([0.04,0.05]), # matrix
    'I': np.array([0.03,0.02]), # inception
    'T': np.array([0.01,0.02])  # titanic
}

# Training data: (user, positive_item, negative_item)
training_data = [('M', 'I'), ('M', 'T'), ('I', 'T')]

total_loss = 0
for pos_item, neg_item in training_data:
    # Compute loss
    loss = compute_loss(user, items, pos_item, neg_item, lambda_reg)
    total_loss += loss
print(f"Loss: {total_loss}")

2.3) Weighted Approximate-Rank Pair-wise (WARP) Loss

WARP is another popular loss method used in the recommender systems when we have implicit feedback from the users and the primary goal is to optimize the top K recommendations for each user. It’s similar to BPR as it also uses the (user, positive, negative) triplet to compute pair-wise loss. But unlike BPR, it estimates the rank of the positive item based on how many sampling attempts were needed to find a violating negative item.

Basically, it samples a positive item, then keeps sampling negative items until it finds one that violates the desired ranking (i.e., pos_item — neg_item < 1).

This is the formula for approximating rank and weight:

rank (r) ≈ floor((|I| — 1) / N), where “I” is the total number of items and “N” is the sampling count
weight = Φ(r) = log(r + 1)

Intuitively, it means that errors for positive items with lower ranks (8 out of 10) get more weight.

Let’s take this example to understand: Consider, we have 1000 items in total.

Scenario 1: It takes 2 samples to find the negative item.

rank = floor((1000–1) / 2) = 499
Weight = log(499+1) = 6.2

Scenario 2: It takes 100 samples to find the negative item.

rank = floor((1000–1) / 100) = 9
Weight = log(9+1) = 2.3

In Scenario 1, the pos_item is ranked lower as it only took a few samples to find the negative item, so the weight is higher. By assigning a higher weight to this pos_item, we intend to adjust the model parameters to assign a higher score to this item. In Scenario 2, it took many samples to find the negative item, so the weight is lower. Intuitively, if the predicted score for the pos_item is high, then it would most likely take many samples to find the negative item. This means that we don’t need to improve the score of this pos_item.

Here’s the simplified Numpy implementation:

import random
import numpy as np
from typing import List

random.seed(1)

def warp_loss(positive_scores: List, negative_scores: List, max_trials: int=100):

    n_pos = len(positive_scores)
    n_neg = len(negative_scores)
    tot_items = n_pos + n_neg
    loss = 0
    for pos_score in positive_scores:
        # counter to check how many times negative
        # item is sampled
        count = 0
        for _ in range(max_trials):
            # Sample a negative item
            neg_idx = np.random.randint(n_neg)
            neg_score = negative_scores[neg_idx]
            
            if pos_score - neg_score < 1:
                # Violation found
                rank = np.floor(tot_items -1)//(count+1)
                # hinge loss
                # This encourages the model to rank positive items higher than 
                # negative items by at least the specified margin (1 in this case).
                loss += np.log(rank+1) * max(0, 1 - (pos_score - neg_score))
                break
            
            count += 1
        
        if count == max_trials:
            # No violation found after max_trials
            rank = (tot_items -1)//max_trials
            loss += np.log(rank)
    
    return loss / n_pos


# Positive (relevant) items
positive_scores = np.array([0.9, 0.8, 0.7])

# Negative (irrelevant) items
negative_scores = np.array([0.6, 0.5, 0.4, 0.3, 0.2])

loss = warp_loss(positive_scores, negative_scores)
print(f"WARP Loss: {loss}")

Different Loss Methods for the Recommender Systems

What is Loss in Machine Learning?

Loss Methods for Recommendation Models

1) Point-wise Loss

2) Pair-wise Loss

References

Written by Mandeep Singh