Basics of Content Based and Collaborative Based Recommendation Engines

8 min readJun 13, 2023

Introduction

Recommendation systems play a vital role in today’s digital landscape by providing personalized suggestions to users, helping them discover relevant content and products. Two popular approaches used in recommendation systems are content-based filtering and collaborative filtering. In this blog, we will delve into these two techniques, understand their working principles, and explore their strengths and weaknesses.

Content-Based Filtering:

Content-based filtering is a popular approach in recommendation systems that utilizes the characteristics and properties of items to make personalized recommendations. It relies on the idea that if a user has liked or shown interest in certain items in the past, they are likely to be interested in similar items in the future.

The basic concept behind content-based filtering is to analyze the content or attributes of items and find similarities or patterns among them. These attributes can vary depending on the domain or type of items being recommended. For example, in the case of movies, attributes could include genre, actors, directors, plot keywords, and user reviews. In the case of articles, attributes could include the text content, author, category, and tags.

1.1 How Content-Based Filtering Works: Content-based filtering follows these steps to generate recommendations:

Step 1: Item Feature Extraction: The system extracts relevant features or attributes from each item. For movies, features could include genres, actors, directors, and plot keywords. For news articles, features could include the title, tags, and content.

Step 2: User Profile Creation: The system builds a user profile by analyzing the items the user has interacted with in the past. This profile represents the user’s preferences based on the features of those items.

Step 3: Similarity Computation: The system calculates the similarity between items using a similarity measure such as cosine similarity or Euclidean distance. It compares the feature vectors of items to determine how similar they are.

Step 4: Recommendation Generation: Based on the user profile and item similarities, the system selects items that are most similar to the ones the user has interacted with and recommends them to the user.

1.2 Advantages of Content-Based Filtering:

Content-based filtering is capable of providing personalized recommendations even for new or niche items because it focuses on item features rather than relying on user feedback.
It can handle the cold start problem, where there is limited user data, by using item features to make recommendations.

1.3 Limitations of Content-Based Filtering:

Content-based filtering tends to recommend similar items, which can lead to limited diversification in recommendations.
It relies heavily on accurate item feature extraction, and if the features are not well-defined or incomplete, it may result in poor recommendations.
It does not consider the opinions or preferences of other users, which can limit the discovery of serendipitous recommendations.

Let’s see in Python how to use Content based filtering on movie dataset

Data: 

movieId,title,genres,overview
1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,Toy Story is about the 'secret life of toys' when people are not around. The film focuses on the relationship between Woody and Buzz Lightyear, two rival toys.
2,Jumanji,Adventure|Children|Fantasy,In Jumanji, a magical board game unleashes jungle-based hazards upon its players with every turn they take.
3,Grumpier Old Men,Comedy|Romance,"Grumpier Old Men is a romantic comedy about the reunion between two aging neighbors, Max and John, who team up against a beautiful woman who moves into the neighborhood."
4,Waiting to Exhale,Comedy|Drama,"Waiting to Exhale follows the lives of four African-American women in Phoenix, Arizona, as they navigate love, friendship, and their personal struggles."
5,Father of the Bride Part II,Comedy,"Father of the Bride Part II is a comedy that explores the challenges faced by a father when he learns that his daughter and wife are both pregnant."

we have the movieID, title , genres and overview. We will recommend by analysing the overview feature of the dataset:

I will use tfidf for getting the features vector for each overview of the movie.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the movie data
data = pd.read_csv('movies.csv')  # Replace 'movies.csv' with the path to your movie dataset file

# Select relevant columns for content-based filtering
selected_columns = ['movieId', 'title', 'genres', 'overview']
data = data[selected_columns]

# Clean the data (optional)
data['genres'] = data['genres'].str.replace('|', ' ')
data['overview'] = data['overview'].fillna('')

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Apply TF-IDF vectorization on movie overviews
tfidf_matrix = tfidf_vectorizer.fit_transform(data['overview'])

# Calculate the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Function to get top N similar movies based on cosine similarity
def get_similar_movies(movie_title, N=5):
    # Get the index of the movie title
    indices = pd.Series(data.index, index=data['title']).drop_duplicates()
    idx = indices[movie_title]

    # Calculate the similarity scores for all movies
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top N similar movies
    top_movies_indices = [i[0] for i in sim_scores[1:N+1]]
    top_movies = data['title'].iloc[top_movies_indices]

    return top_movies

# Example usage: Get top 5 similar movies to "Toy Story"
similar_movies = get_similar_movies("Toy Story", N=5)
print(similar_movies)

Here, I supposed that I a person likes movie ‘Top Story” , I am recommending the top 5 movies that are similar to ‘Top Story’
This is how the content based recommendation engine works.

Collaborative Based Filtering:

Collaborative filtering is a recommendation approach that relies on user-item interactions and similarities between users or items. It assumes that users with similar preferences in the past will have similar preferences in the future. Let’s explore how collaborative filtering works.

2.1 How Collaborative Filtering Works: Collaborative filtering follows these steps to generate recommendations:

Step 1: User-Item Matrix: The system creates a user-item matrix that captures user interactions with items. Each entry in the matrix represents a user’s rating, purchase history, or any other relevant interaction with an item.

Step 2: Similarity Computation: The system calculates the similarity between users or items based on their interactions. Similarity measures such as cosine similarity or Pearson correlation are commonly used.

Step 3: Nearest Neighbors: For a target user or item, the system identifies the nearest neighbors based on similarity scores. These neighbors are users or items with similar preferences or characteristics.

Step 4: Recommendation Generation: The system recommends items that the target user has not interacted with but have been interacted with by the nearest neighbors. It assumes that if the neighbors liked those items, the target user might also like them.

2.2 Advantages of Collaborative Filtering:

Collaborative filtering can capture complex user preferences and generate serendipitous recommendations by leveraging the collective wisdom of users.
It does not rely on explicit item features or domain knowledge, making it applicable to a wide range of domains.
Collaborative filtering can provide diverse recommendations by considering the preferences of different users.

2.3 Limitations of Collaborative Filtering:

Collaborative filtering suffers from the cold start problem, where new users or items have limited or no interaction data, making it challenging to generate accurate recommendations.
It requires a substantial amount of user-item interaction data to identify meaningful patterns and similarities, which can be a challenge for new platforms or sparse datasets.
Collaborative filtering may result in the “popular item” problem, where highly popular items tend to dominate the recommendations, neglecting niche or personalized preferences.

Problem of Sparsity in User-Item Matrix:

In collaborative filtering, one common approach to dealing with the sparse user-item matrix is to apply matrix factorization techniques. Matrix factorization is a machine learning method that aims to decompose the original matrix into two lower-dimensional matrices, typically referred to as the user matrix and the item matrix.

Here’s an example of how matrix factorization can be used to address the sparsity problem in collaborative filtering:

Data Preparation: Start with a user-item matrix where each entry represents the interaction or rating of a user for an item. This matrix is often sparse, meaning most of the entries are missing because users have not interacted with or rated all items.
Matrix Factorization: Apply matrix factorization techniques, such as Singular Value Decomposition (SVD) or Alternating Least Squares (ALS), to decompose the user-item matrix into two matrices: a user matrix and an item matrix. The dimensions of these matrices are typically chosen to be lower than the original matrix, capturing the latent factors or features that describe users and items.
Latent Factor Calculation: The latent factors represent the underlying characteristics or preferences of users and items. These factors are learned during the matrix factorization process and aim to capture the relationships between users and items based on their interactions.
Missing Value Estimation: Since the original user-item matrix is sparse, some entries are missing. To fill in these missing values, the learned user and item matrices can be used to estimate the ratings or interactions between users and items. This is done by taking the dot product of the corresponding user and item vectors from the decomposed matrices.
Recommendation Generation: Once the missing values are estimated, recommendations can be generated by identifying the items with the highest predicted ratings for a particular user. These recommended items are those that the user is most likely to be interested in based on their past interactions and the preferences learned from other users.
Model Evaluation: The performance of the collaborative filtering model can be evaluated using metrics such as precision, recall, or Mean Average Precision (MAP). These metrics assess how well the model predicts user-item interactions compared to the actual interactions in the test or validation set.

Matrix factorization techniques in collaborative filtering offer a powerful way to handle sparsity in the user-item matrix. By learning latent factors that capture user preferences and item characteristics, these methods can make accurate predictions for missing values and provide personalized recommendations to users.

It’s worth noting that there are variations and extensions of matrix factorization in collaborative filtering, such as weighted matrix factorization, probabilistic matrix factorization, or deep learning-based approaches like neural networks. These techniques aim to further enhance the performance and accuracy of collaborative filtering models in dealing with sparse user-item matrices.

let’s implement collaborative based filtering in python:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load the MovieLens dataset
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Merge movies and ratings data
movie_ratings = pd.merge(movies, ratings, on='movieId')

# Create a user-item matrix
user_item_matrix = movie_ratings.pivot_table(index='userId', columns='movieId', values='rating')

# Calculate the similarity between users using cosine similarity
user_similarity = cosine_similarity(user_item_matrix.fillna(0))

# Function to generate recommendations for a user based on collaborative filtering
def generate_recommendations(user_id, num_recommendations):
    # Find the top N similar users to the given user
    similar_users = np.argsort(user_similarity[user_id])[::-1][1:num_recommendations+1]

    # Get the movies watched and rated by the similar users
    similar_users_ratings = user_item_matrix.iloc[similar_users].dropna(axis=1, how='all')

    # Calculate the average rating of each movie
    movie_avg_ratings = similar_users_ratings.mean()

    # Filter out the movies already watched and rated by the user
    watched_movies = user_item_matrix.loc[user_id].dropna().index
    movie_avg_ratings = movie_avg_ratings[~movie_avg_ratings.index.isin(watched_movies)]

    # Sort the movies based on average rating in descending order
    recommendations = movie_avg_ratings.sort_values(ascending=False)[:num_recommendations]

    # Get the movie titles
    recommended_movies = movies[movies['movieId'].isin(recommendations.index)]

    return recommended_movies[['movieId', 'title', 'genres']]

# Generate recommendations for a specific user
user_id = 1
num_recommendations = 5
recommendations = generate_recommendations(user_id, num_recommendations)

print(f"Top {num_recommendations} recommendations for User {user_id}:")
print(recommendations)

Reference:

https://chaitanyabelhekar.medium.com/recommendation-systems-a-walk-trough-33587fecc195