Getting Started with Recommender Systems : Content-Based Filtering

7 min readFeb 15, 2023

Introduction

You’re sitting on your couch, scrolling through Netflix’s endless list of movies and TV shows, trying to decide what to watch. Suddenly, a thought hits you: “I wish this site could just know what I like and recommend something I’d enjoy!” Well, that’s exactly what Recommender Systems do!

Recommender Systems are a type of artificial intelligence that predict the likelihood of a user’s interest in a particular item. They analyze data about users’ preferences, behaviors, and interactions to make recommendations that are tailored to each individual.They have become increasingly popular in recent years, and can be found in many different applications, from online shopping to music and video streaming.

Recommender Systems are essential in today’s world of information overload. With so much content available, it’s nearly impossible for people to sift through it all and find what they’re looking for. Recommender Systems help users find what they’re interested in, and also help companies recommend products and services that are likely to be successful.

How Recommender Systems Work?

A Recommender System is basically a software system that is designed to predict user preferences or interests based on their past behavior. It uses Machine Learning algorithms to capture user habits, preferences, or trends in order to recommend items or services of interest. This can be in the form of suggestions for movies, music, restaurants, or even products to buy. The system typically uses data mining methods and collaborative filtering techniques to generate recommendations and suggestions. The main purpose of a recommender system is to predict user preferences and recommendations based on those predictions. This helps to provide an efficient and personalized user experience by maximizing the value of the items recommended for each user.

In this blog ,I’ll provide an intuitive overview of the recommendation system architecture.Typically, a recommendation engine processes data through the below four phases.

1. Data Collection: The first step in the recommendation engine process is to collect the data. This data is typically collected from interactions and user behavior, such as product ratings, past purchases, and browsing history.

2. Data Preparation: Once the data has been collected, it then needs to be prepared for analysis. This includes cleaning and formatting the data, removing any outliers or invalid points, categorizing data, and so on.

3.Building Recommender: Once the data has been prepared for analysis, a model is built or selected. Common models include collaborative filtering, content-based filtering, and hybrid models.

4. Evaluation and Deployment: The model is then evaluated using various metrics and techniques, such as accuracy, precision, recall, or F1 score. If the results are satisfactory, the model is put into production and deployed for use.

In this article, I will explain how I worked throughout the entire life cycle of this project. Click here to access the Github Repository

1. Data Collection:

Overview of the MovieLens dataset

The MovieLens dataset is one of the largest and most widely used datasets for evaluating recommendation algorithms. It was collected by the GroupLens research group at the University of Minnesota and contains 100836 ratings (1–5) of 9742 movies by 610 MovieLens users who joined MovieLens in 2018. The dataset is available for free online and has been used in many research papers and projects.

There are 2 different tables I will use for the project: movies.csv and ratings.csv.

2. Data Preparation:

Since the data was already clean, I didn’t spend much time on data preparation. I simply dropped the unnecessary columns and concatenated the two tables. And of course I did some explotary data analysis (Univariate Analysis) to get answers of the questions like What are the movies with most reviews? Who are the users that provide most reviews? How does the distribution looks like for ratings?

3. Building Recommender:

Types of Recommender Systems

There are several types of recommender systems, including:

Content-Based Filtering: These systems recommend items based on the attributes of the items the user has liked in the past. For example, if you’ve rated several action movies highly, a content-based recommender system might recommend another action movie to you.
Collaborative Filtering: These systems recommend items based on the behavior of other users with similar preferences. For example, if several users who enjoy romantic comedies also like a particular movie, a collaborative filtering recommender system might recommend that movie to you.
Hybrid Recommendations: As the name suggests, these systems combine aspects of both content-based and collaborative filtering recommendations.

1.Content-Based Filtering

Basically a content based recommendation system for movies works like this:

The system collects information about the movies a user has liked, such as their genre, director, and actors.
The system compares these movies to other movies in its database and identifies similar movies based on their attributes.
The system then generates recommendations for the user by presenting them with similar movies that they may enjoy.

You can check out my blog post where I explain TFIDF.

from itertools import combinations

# Define a TF-IDF Vectorizer Object.
tfidf_genres = TfidfVectorizer(analyzer=lambda s: (c for i in range(1,4)
                                             for c in combinations(s.split('|'), r=i)))

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf_genres.fit_transform(movies['genres'])

# I've encoded each movie's genre into its tf-idf representation (only a subset of the columns and rows is sampled)
pd.DataFrame(tfidf_matrix.todense(), columns=tfidf_genres.get_feature_names(), index=movies.title).sample(5, axis=1).sample(10, axis=0)

# Compute the cosine similarity matrix
cosine_sim_movies = cosine_similarity(tfidf_matrix)

cosine_sim_df = pd.DataFrame(cosine_sim_movies, index=movies['title'],columns=movies['title'])
cosine_sim_df.sample(5, axis=1).round(2)

The cosine similarity matrix in a content-based movie recommender system measures the similarity between two movies based on attributes such as genre, director, and actors. The matrix compares the feature vectors of the movies to determine their similarity, with a score of 1 indicating that the movies are identical and a score of 0 indicating that they are completely dissimilar.

When a user has previously liked a movie with a certain genre, the system can use the cosine similarity matrix to recommend other movies with the same genre. By using this matrix, the system can make more accurate and personalized recommendations based on the user’s past preferences.

def get_recommendations_based_on_genres(movie_title, cosine_sim_movies=cosine_sim_movies):
    """
    Calculates top 10 movies to recommend based on given movie titles genres. 
    :param movie_title: title of movie to be taken for base of recommendation
    :param cosine_sim_movies: cosine similarity between movies 
    :return: Titles of movies recommended to user
    """
    # Get the index of the movie that matches the title
    idx_movie = movies.loc[movies['title'].isin([movie_title])]
    idx_movie = idx_movie.index
    
    # Get the cosine similarity scores of all movies with that movie
    sim_scores_movies = list(enumerate(cosine_sim_movies[idx_movie][0]))
    
    # Sort the movies based on the similarity scores
    sim_scores_movies = sorted(sim_scores_movies, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores_movies = sim_scores_movies[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores_movies]
    
    # Return the top 10 most similar movies
    return pd.DataFrame(movies[['title','genres']].iloc[movie_indices])

movies[movies.title.eq('Pulp Fiction (1994)')]

get_recommendations_based_on_genres('Pulp Fiction (1994)')

Why use content-based filtering?

Content-based filtering does not require data from other users to generate recommendations, which makes it different from collaborative filtering. This means that once a user has interacted with a few items by searching and browsing or making purchases, a content-based filtering system can start recommending relevant items without needing data from other users. This makes content-based filtering a good option for businesses with a small pool of users or for sellers with few user interactions in specific categories or niches. Essentially, content-based filtering is more self-sufficient than other filtering methods, allowing it to provide recommendations with a smaller amount of data.
One advantage of content-based filtering over collaborative filtering is that it avoids the “cold start” problem. When a new website or community has few users, collaborative filtering can create a cold start scenario where there are not enough data points to make accurate recommendations. On the other hand, while content-based filtering requires some initial user inputs to begin making recommendations, the quality of the recommendations is typically better than a collaborative system that needs millions of data points to become optimized. This means that content-based filtering can start providing useful recommendations to users more quickly, without the need for a large amount of data

Limitations of content-based filtering

Limited Scope: Content-based filtering only recommends items based on the attributes of the items a user has liked, which can limit the range of recommendations and miss out on items that the user may like but have not yet interacted with.
Cold Start: For new items that have no data on user preferences, the system cannot make recommendations until there is enough information available on the item.
Over-specialization: Content-based filtering can create a “filter bubble” where users only receive recommendations for similar items, which can limit exposure to new and diverse content.