Understanding Recommender System Metrics: A Deep Dive into Python Implementation
Recommender systems have become an integral part of many applications, from e-commerce to streaming services.
But how do we measure the effectiveness of these systems?
Today, we’ll explore a Python class called RecommenderMetrics
that implements various metrics for evaluating recommender systems.
Overview of the RecommenderMetrics Class
The RecommenderMetrics
class provides a set of static methods to calculate different performance metrics for recommender systems. These metrics help us understand various aspects of our recommendations, such as accuracy, relevance, and diversity.
Let’s break down each method and understand its purpose and implementation:
1. MAE (Mean Absolute Error)
def MAE(predictions):
return accuracy.mae(predictions, verbose=False)
MAE measures the average absolute difference between predicted ratings and actual ratings. It gives us an idea of how far off our predictions are on average. The method uses the mae
function from the surprise.accuracy
module, which is a popular library for recommender systems.
2. RMSE (Root Mean Square Error)
def RMSE(predictions):
return accuracy.rmse(predictions, verbose=False)
RMSE is similar to MAE but gives more weight to larger errors. It’s calculated by taking the square root of the average of squared differences between predicted and actual ratings. Like MAE, it uses a function from the surprise.accuracy
module.
3. GetTopN
def GetTopN(predictions, n=10, minimumRating=4.0):
topN = defaultdict(list)
for userID, movieID, actualRating, estimatedRating, _ in predictions:
if (estimatedRating >= minimumRating):
topN[int(userID)].append((int(movieID), estimatedRating))
for userID, ratings in topN.items():
ratings.sort(key=lambda x: x[1], reverse=True)
topN[int(userID)] = ratings[:n]
return topN
This method processes the predictions to create a “Top-N” list of recommendations for each user. It filters predictions based on a minimum rating threshold and returns the top N items for each user, sorted by predicted rating.
4. HitRate
def HitRate(topNPredicted, leftOutPredictions):
hits = 0
total = 0
# For each left-out rating
for leftOut in leftOutPredictions:
userID = leftOut[0]
leftOutMovieID = leftOut[1]
# Is it in the predicted top 10 for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == int(movieID)):
hit = True
break
if (hit) :
hits += 1
total += 1
# Compute overall precision
return hits/total
HitRate calculates the proportion of left-out items that appear in the top-N recommendations. It’s a measure of how often the system can successfully recommend items that the user actually rated highly.
5. CumulativeHitRate
def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
hits = 0
total = 0
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Only look at ability to recommend things the users actually liked...
if (actualRating >= ratingCutoff):
# Is it in the predicted top 10 for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == movieID):
hit = True
break
if (hit) :
hits += 1
total += 1
# Compute overall precision
return hits/total
This is similar to HitRate but only considers items above a certain rating cutoff. It helps measure how well the system recommends items that users actually liked.
6. RatingHitRate
def RatingHitRate(topNPredicted, leftOutPredictions):
hits = defaultdict(float)
total = defaultdict(float)
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Is it in the predicted top N for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == movieID):
hit = True
break
if (hit) :
hits[actualRating] += 1
total[actualRating] += 1
# Compute overall precision
for rating in sorted(hits.keys()):
print (rating, hits[rating] / total[rating])
RatingHitRate breaks down the hit rate by rating value. It shows how the system performs for different rating levels, which can be useful for understanding if the system is better at predicting highly-rated items or lower-rated ones.
7. AverageReciprocalHitRank
def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
summation = 0
total = 0
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Is it in the predicted top N for this user?
hitRank = 0
rank = 0
for movieID, predictedRating in topNPredicted[int(userID)]:
rank = rank + 1
if (int(leftOutMovieID) == movieID):
hitRank = rank
break
if (hitRank > 0) :
summation += 1.0 / hitRank
total += 1
return summation / total
This metric takes into account the position of the hit in the top-N list. A hit at the top of the list is worth more than a hit further down. It’s useful for evaluating the ranking quality of the recommendations.
8. UserCoverage
def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
hits = 0
for userID in topNPredicted.keys():
hit = False
for movieID, predictedRating in topNPredicted[userID]:
if (predictedRating >= ratingThreshold):
hit = True
break
if (hit):
hits += 1
return hits / numUsers
9. Diversity
def Diversity(topNPredicted, simsAlgo):
n = 0
total = 0
simsMatrix = simsAlgo.compute_similarities()
for userID in topNPredicted.keys():
pairs = itertools.combinations(topNPredicted[userID], 2)
for pair in pairs:
movie1 = pair[0][0]
movie2 = pair[1][0]
innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
similarity = simsMatrix[innerID1][innerID2]
total += similarity
n += 1
S = total / n
return (1-S)
Diversity measures how different the recommended items are from each other. It uses a similarity algorithm to compute pairwise similarities between recommended items and returns a score where higher values indicate more diverse recommendations.
10. Novelty
def Novelty(topNPredicted, rankings):
n = 0
total = 0
for userID in topNPredicted.keys():
for rating in topNPredicted[userID]:
movieID = rating[0]
rank = rankings[movieID]
total += rank
n += 1
return total / n
Novelty measures how “unusual” or unpopular the recommended items are. It uses a pre-computed ranking of items (presumably by popularity) and calculates the average rank of recommended items. A higher score indicates more novel (less popular) recommendations.
Full code
import itertools
from surprise import accuracy
from collections import defaultdict
class RecommenderMetrics:
def MAE(predictions):
return accuracy.mae(predictions, verbose=False)
def RMSE(predictions):
return accuracy.rmse(predictions, verbose=False)
def GetTopN(predictions, n=10, minimumRating=4.0):
topN = defaultdict(list)
for userID, movieID, actualRating, estimatedRating, _ in predictions:
if (estimatedRating >= minimumRating):
topN[int(userID)].append((int(movieID), estimatedRating))
for userID, ratings in topN.items():
ratings.sort(key=lambda x: x[1], reverse=True)
topN[int(userID)] = ratings[:n]
return topN
def HitRate(topNPredicted, leftOutPredictions):
hits = 0
total = 0
# For each left-out rating
for leftOut in leftOutPredictions:
userID = leftOut[0]
leftOutMovieID = leftOut[1]
# Is it in the predicted top 10 for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == int(movieID)):
hit = True
break
if (hit) :
hits += 1
total += 1
# Compute overall precision
return hits/total
def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
hits = 0
total = 0
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Only look at ability to recommend things the users actually liked...
if (actualRating >= ratingCutoff):
# Is it in the predicted top 10 for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == movieID):
hit = True
break
if (hit) :
hits += 1
total += 1
# Compute overall precision
return hits/total
def RatingHitRate(topNPredicted, leftOutPredictions):
hits = defaultdict(float)
total = defaultdict(float)
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Is it in the predicted top N for this user?
hit = False
for movieID, predictedRating in topNPredicted[int(userID)]:
if (int(leftOutMovieID) == movieID):
hit = True
break
if (hit) :
hits[actualRating] += 1
total[actualRating] += 1
# Compute overall precision
for rating in sorted(hits.keys()):
print (rating, hits[rating] / total[rating])
def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
summation = 0
total = 0
# For each left-out rating
for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
# Is it in the predicted top N for this user?
hitRank = 0
rank = 0
for movieID, predictedRating in topNPredicted[int(userID)]:
rank = rank + 1
if (int(leftOutMovieID) == movieID):
hitRank = rank
break
if (hitRank > 0) :
summation += 1.0 / hitRank
total += 1
return summation / total
# What percentage of users have at least one "good" recommendation
def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
hits = 0
for userID in topNPredicted.keys():
hit = False
for movieID, predictedRating in topNPredicted[userID]:
if (predictedRating >= ratingThreshold):
hit = True
break
if (hit):
hits += 1
return hits / numUsers
def Diversity(topNPredicted, simsAlgo):
n = 0
total = 0
simsMatrix = simsAlgo.compute_similarities()
for userID in topNPredicted.keys():
pairs = itertools.combinations(topNPredicted[userID], 2)
for pair in pairs:
movie1 = pair[0][0]
movie2 = pair[1][0]
innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
similarity = simsMatrix[innerID1][innerID2]
total += similarity
n += 1
S = total / n
return (1-S)
def Novelty(topNPredicted, rankings):
n = 0
total = 0
for userID in topNPredicted.keys():
for rating in topNPredicted[userID]:
movieID = rating[0]
rank = rankings[movieID]
total += rank
n += 1
return total / n
Conclusion
The RecommenderMetrics
class provides a comprehensive set of tools for evaluating recommender systems. By using these metrics, developers can gain insights into various aspects of their recommender system's performance, from basic accuracy to more nuanced concepts like diversity and novelty.
When implementing a recommender system, it’s crucial to consider multiple metrics, as each provides a different perspective on the system’s performance. A well-rounded evaluation using these metrics can help in fine-tuning the recommender system and ensuring it provides value to users.
Remember, the choice of which metrics to prioritize depends on your specific use case and goals. For example, an e-commerce site might prioritize accuracy and hit rate, while a music streaming service might place more emphasis on diversity and novelty.
By understanding and utilizing these metrics, you can create more effective and user-friendly recommender systems that keep your users engaged and satisfied.