Understanding Recommender System Metrics: A Deep Dive into Python Implementation

Recommender systems have become an integral part of many applications, from e-commerce to streaming services.

fr4nk.xyz

Published in

myorder

7 min readAug 4, 2024

But how do we measure the effectiveness of these systems?

Today, we’ll explore a Python class called RecommenderMetrics that implements various metrics for evaluating recommender systems.

Overview of the RecommenderMetrics Class

The RecommenderMetrics class provides a set of static methods to calculate different performance metrics for recommender systems. These metrics help us understand various aspects of our recommendations, such as accuracy, relevance, and diversity.

Let’s break down each method and understand its purpose and implementation:

1. MAE (Mean Absolute Error)

def MAE(predictions):
    return accuracy.mae(predictions, verbose=False)

MAE measures the average absolute difference between predicted ratings and actual ratings. It gives us an idea of how far off our predictions are on average. The method uses the mae function from the surprise.accuracy module, which is a popular library for recommender systems.

2. RMSE (Root Mean Square Error)

def RMSE(predictions):
    return accuracy.rmse(predictions, verbose=False)

RMSE is similar to MAE but gives more weight to larger errors. It’s calculated by taking the square root of the average of squared differences between predicted and actual ratings. Like MAE, it uses a function from the surprise.accuracy module.

3. GetTopN

def GetTopN(predictions, n=10, minimumRating=4.0):
  topN = defaultdict(list)

  for userID, movieID, actualRating, estimatedRating, _ in predictions:
      if (estimatedRating >= minimumRating):
          topN[int(userID)].append((int(movieID), estimatedRating))

  for userID, ratings in topN.items():
      ratings.sort(key=lambda x: x[1], reverse=True)
      topN[int(userID)] = ratings[:n]

  return topN

This method processes the predictions to create a “Top-N” list of recommendations for each user. It filters predictions based on a minimum rating threshold and returns the top N items for each user, sorted by predicted rating.

4. HitRate

def HitRate(topNPredicted, leftOutPredictions):
    hits = 0
    total = 0

    # For each left-out rating
    for leftOut in leftOutPredictions:
        userID = leftOut[0]
        leftOutMovieID = leftOut[1]
        # Is it in the predicted top 10 for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == int(movieID)):
                hit = True
                break
        if (hit) :
            hits += 1

        total += 1

    # Compute overall precision
    return hits/total

HitRate calculates the proportion of left-out items that appear in the top-N recommendations. It’s a measure of how often the system can successfully recommend items that the user actually rated highly.

5. CumulativeHitRate

def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
    hits = 0
    total = 0

    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Only look at ability to recommend things the users actually liked...
        if (actualRating >= ratingCutoff):
            # Is it in the predicted top 10 for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == movieID):
                    hit = True
                    break
            if (hit) :
                hits += 1

            total += 1

    # Compute overall precision
    return hits/total

This is similar to HitRate but only considers items above a certain rating cutoff. It helps measure how well the system recommends items that users actually liked.

6. RatingHitRate

def RatingHitRate(topNPredicted, leftOutPredictions):
    hits = defaultdict(float)
    total = defaultdict(float)

    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == movieID):
                hit = True
                break
        if (hit) :
            hits[actualRating] += 1

        total[actualRating] += 1

    # Compute overall precision
    for rating in sorted(hits.keys()):
        print (rating, hits[rating] / total[rating])

RatingHitRate breaks down the hit rate by rating value. It shows how the system performs for different rating levels, which can be useful for understanding if the system is better at predicting highly-rated items or lower-rated ones.

7. AverageReciprocalHitRank

def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
    summation = 0
    total = 0
    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hitRank = 0
        rank = 0
        for movieID, predictedRating in topNPredicted[int(userID)]:
            rank = rank + 1
            if (int(leftOutMovieID) == movieID):
                hitRank = rank
                break
        if (hitRank > 0) :
            summation += 1.0 / hitRank

        total += 1

    return summation / total

This metric takes into account the position of the hit in the top-N list. A hit at the top of the list is worth more than a hit further down. It’s useful for evaluating the ranking quality of the recommendations.

8. UserCoverage

def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
    hits = 0
    for userID in topNPredicted.keys():
        hit = False
        for movieID, predictedRating in topNPredicted[userID]:
            if (predictedRating >= ratingThreshold):
                hit = True
                break
        if (hit):
            hits += 1
  
    return hits / numUsers

9. Diversity

def Diversity(topNPredicted, simsAlgo):
    n = 0
    total = 0
    simsMatrix = simsAlgo.compute_similarities()
    for userID in topNPredicted.keys():
        pairs = itertools.combinations(topNPredicted[userID], 2)
        for pair in pairs:
            movie1 = pair[0][0]
            movie2 = pair[1][0]
            innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
            innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
            similarity = simsMatrix[innerID1][innerID2]
            total += similarity
            n += 1

    S = total / n
    return (1-S)

Diversity measures how different the recommended items are from each other. It uses a similarity algorithm to compute pairwise similarities between recommended items and returns a score where higher values indicate more diverse recommendations.

10. Novelty

def Novelty(topNPredicted, rankings):
    n = 0
    total = 0
    for userID in topNPredicted.keys():
        for rating in topNPredicted[userID]:
            movieID = rating[0]
            rank = rankings[movieID]
            total += rank
            n += 1
    return total / n

Novelty measures how “unusual” or unpopular the recommended items are. It uses a pre-computed ranking of items (presumably by popularity) and calculates the average rank of recommended items. A higher score indicates more novel (less popular) recommendations.

Full code

import itertools

from surprise import accuracy
from collections import defaultdict

class RecommenderMetrics:

    def MAE(predictions):
        return accuracy.mae(predictions, verbose=False)

    def RMSE(predictions):
        return accuracy.rmse(predictions, verbose=False)

    def GetTopN(predictions, n=10, minimumRating=4.0):
        topN = defaultdict(list)


        for userID, movieID, actualRating, estimatedRating, _ in predictions:
            if (estimatedRating >= minimumRating):
                topN[int(userID)].append((int(movieID), estimatedRating))

        for userID, ratings in topN.items():
            ratings.sort(key=lambda x: x[1], reverse=True)
            topN[int(userID)] = ratings[:n]

        return topN

    def HitRate(topNPredicted, leftOutPredictions):
        hits = 0
        total = 0

        # For each left-out rating
        for leftOut in leftOutPredictions:
            userID = leftOut[0]
            leftOutMovieID = leftOut[1]
            # Is it in the predicted top 10 for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == int(movieID)):
                    hit = True
                    break
            if (hit) :
                hits += 1

            total += 1

        # Compute overall precision
        return hits/total

    def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
        hits = 0
        total = 0

        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Only look at ability to recommend things the users actually liked...
            if (actualRating >= ratingCutoff):
                # Is it in the predicted top 10 for this user?
                hit = False
                for movieID, predictedRating in topNPredicted[int(userID)]:
                    if (int(leftOutMovieID) == movieID):
                        hit = True
                        break
                if (hit) :
                    hits += 1

                total += 1

        # Compute overall precision
        return hits/total

    def RatingHitRate(topNPredicted, leftOutPredictions):
        hits = defaultdict(float)
        total = defaultdict(float)

        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Is it in the predicted top N for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == movieID):
                    hit = True
                    break
            if (hit) :
                hits[actualRating] += 1

            total[actualRating] += 1

        # Compute overall precision
        for rating in sorted(hits.keys()):
            print (rating, hits[rating] / total[rating])

    def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
        summation = 0
        total = 0
        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Is it in the predicted top N for this user?
            hitRank = 0
            rank = 0
            for movieID, predictedRating in topNPredicted[int(userID)]:
                rank = rank + 1
                if (int(leftOutMovieID) == movieID):
                    hitRank = rank
                    break
            if (hitRank > 0) :
                summation += 1.0 / hitRank

            total += 1

        return summation / total

    # What percentage of users have at least one "good" recommendation
    def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
        hits = 0
        for userID in topNPredicted.keys():
            hit = False
            for movieID, predictedRating in topNPredicted[userID]:
                if (predictedRating >= ratingThreshold):
                    hit = True
                    break
            if (hit):
                hits += 1

        return hits / numUsers

    def Diversity(topNPredicted, simsAlgo):
        n = 0
        total = 0
        simsMatrix = simsAlgo.compute_similarities()
        for userID in topNPredicted.keys():
            pairs = itertools.combinations(topNPredicted[userID], 2)
            for pair in pairs:
                movie1 = pair[0][0]
                movie2 = pair[1][0]
                innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
                innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
                similarity = simsMatrix[innerID1][innerID2]
                total += similarity
                n += 1

        S = total / n
        return (1-S)

    def Novelty(topNPredicted, rankings):
        n = 0
        total = 0
        for userID in topNPredicted.keys():
            for rating in topNPredicted[userID]:
                movieID = rating[0]
                rank = rankings[movieID]
                total += rank
                n += 1
        return total / n

Conclusion

The RecommenderMetrics class provides a comprehensive set of tools for evaluating recommender systems. By using these metrics, developers can gain insights into various aspects of their recommender system's performance, from basic accuracy to more nuanced concepts like diversity and novelty.

When implementing a recommender system, it’s crucial to consider multiple metrics, as each provides a different perspective on the system’s performance. A well-rounded evaluation using these metrics can help in fine-tuning the recommender system and ensuring it provides value to users.

Remember, the choice of which metrics to prioritize depends on your specific use case and goals. For example, an e-commerce site might prioritize accuracy and hit rate, while a music streaming service might place more emphasis on diversity and novelty.

By understanding and utilizing these metrics, you can create more effective and user-friendly recommender systems that keep your users engaged and satisfied.