Enter the world of personalized experiences with Recommender Systems
In this article, we are going to learn how to build a hybrid recommendation system
By: Daksh Bhatnagar
INTRODUCTION
A recommender system, or a recommendation system (sometimes replacing ‘system’ with a synonym such as a platform or an engine), is a subclass of information filtering systems that provide suggestions for items that are most relevant to a particular user. Typically, the suggestions refer to various decision-making processes, such as what product to purchase, what music to listen to, or what online news to read.
Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.
Recommender systems are used in a variety of areas, with commonly recognized examples taking the form of playlist generators for video and music services, product recommenders for online stores, content recommenders for social media platforms, and open web content recommenders.
Most of the e-commerce companies such as Amazon, Flipkart, Swiggy, and HealthKart are most likely to use recommender systems since this is exactly how a user would end up discovering a product or a dish they might like but they may not have heard of since there are so many products or dishes to choose from.
TYPES OF RECOMMENDER SYSTEMS
- Collaborative filtering Recommender Systems
Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items.
2. Popularity-Based Recommender Systems
Recommendation system which works on the principle of popularity and or anything which is in trend. These systems check the product or movies which are in trend or are most popular among the users and directly recommend those.
For example, if a product is often purchased by most people then the system will get to know that that product is most popular so for every new user who just signed it, the system will recommend that product to that user also and chances become high that the new user will also purchase that.
3. Content-Based Recommender Systems
Another common approach when designing recommender systems is content-based filtering. Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user.
Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on an item’s features.
4. Hybrid recommendations
Most recommender systems now use a hybrid approach, combining collaborative filtering, content-based filtering, and other approaches. Hybrid approaches can be implemented in several ways:
a. by making content-based and collaborative-based predictions separately and then combining them;
b. by adding content-based capabilities to a collaborative-based approach (and vice versa); or
c. by unifying the approaches into one model.
Several studies compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrated that the hybrid methods can provide more accurate recommendations than pure approaches.
NEED FOR A HYBRID RECOMMENDER SYSTEM
Cons of Content-based recommender system?
- Difficulty in understanding user preferences: Content-based recommendation systems require a lot of input from the user in order to make accurate recommendations. This can be difficult for users who are unfamiliar with the product or service, as they may not know how to properly express their preferences.
- Difficulty in capturing user context: Content-based systems can be limited in their ability to capture user context. They are unable to take into account user behavior or mood, which can be important factors in predicting user preferences.
- Over-personalization: Content-based recommendation systems can result in a “filter bubble” effect, where the same recommendations are served to users regardless of their context or preferences. This can lead to a lack of variety and limit users’ exposure to new content.
Cons of Collaborative recommender system
- Dependency on User Input: Collaborative recommendation systems rely heavily on user input, which can be unreliable and inaccurate. If a user provides incorrect information, it can lead to inaccurate recommendations.
- Privacy Concerns: Another potential downside of collaborative recommendation systems is the potential privacy concerns. Depending on what type of information is being collected, users may be concerned about how their data is being used.
- The difficulty of Implementing: Implementing a collaborative recommender system can be complex and time-consuming. It requires a lot of data processing and analysis to create accurate recommendations.
FILTERING TO BUILD A ROBUST RECOMMENDER SYSTEM
Filtering by AvgRating
- Filtering data based on average ratings help to identify which items are most popular and/or highly rated. This allows a recommender system to better personalize its recommendations to a user by suggesting items that are likely to be of higher quality or more likely to be enjoyed by the user.
- By filtering data based on the average rating, a recommender system can ensure that it is only suggesting items to a user that have some level of popularity or quality.
Filtering by Number of Ratings
- Filtering data based on the number of ratings is important in recommender systems because it helps to ensure that the recommendations given are of high quality.
- Having a minimum number of ratings for an item helps to ensure that the item is being recommended based on real user feedback, and not just on the basis of some random factors.
- This also helps to reduce the potential for bias in the system, as items with fewer ratings may be more likely to be recommended on the basis of their own attributes, rather than on the basis of user feedback.
BUILDING A HYBRID RECOMMENDER SYSTEM
We will be working with the movie data and will use urlretrieve
library and zipfile
library to fetch and download our data. Our data contains 100000 rows and 34 columns.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from urllib.request import urlretrieve
import zipfile
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
# Load each data set (users, movies, and ratings).
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=users_cols, encoding='latin-1')
ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t',
names=ratings_cols, encoding='latin-1')
# The movies file contains a binary feature for each genre.
genre_cols = ["genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
"Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
"Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
movies_cols = ['movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols
movies = pd.read_csv( 'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')
# Since the ids start at 1, we shift them to start at 0.
users["user_id"] = users["user_id"].apply(lambda x: str(x-1))
movies["movie_id"] = movies["movie_id"].apply(lambda x: str(x-1))
movies["year"] = movies['release_date'].apply(lambda x: str(x).split('-')[-1])
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: str(x-1))
ratings["user_id"] = ratings["user_id"].apply(lambda x: str(x-1))
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))
# Compute the number of movies to which a genre is assigned.
genre_occurences = movies[genre_cols].sum().to_dict()
# Since some movies can belong to more than one genre, we create different
# 'genre' columns as follows:
# - all_genres: all the active genres of the movie.
# - genre: randomly sampled from the active genres.
def mark_genres(movies, genres):
def get_random_genre(gs):
active = [genre for genre, g in zip(genres, gs) if g==1]
if len(active) == 0:
return 'Other'
return np.random.choice(active)
def get_all_genres(gs):
active = [genre for genre, g in zip(genres, gs) if g==1]
if len(active) == 0:
return 'Other'
return '-'.join(active)
movies['genre'] = [
get_random_genre(gs) for gs in zip(*[movies[genre] for genre in genres])]
movies['all_genres'] = [
get_all_genres(gs) for gs in zip(*[movies[genre] for genre in genres])]
mark_genres(movies, genre_cols)
# Create one merged DataFrame containing allaa the movielens data.
movielens = ratings.merge(movies, on='movie_id').merge(users, on='user_id')
Before proceeding ahead, it’s also nice to get rid of any movies that had less than 2 as an average rating and had received less than 100 ratings. Doing this ensures that the quality recommendations are only being made to the users.
We will now go ahead and build the function that will help us create the hybrid recommendations (a mixture of both content-based and collaborative filtering recommendations)
def content_recommendation(title):
"""
Returns a list of content recommendations based on the provided title.
The recommendations are determined by calculating the cosine similarity between the genre of the provided title and
the genres of other content in the dataframe, df. The top 100 most similar content are selected, and duplicates are
removed to return a list of at most 10 content recommendations.
"""
# Initialize TfidfVectorizer and fit it to the genres in the dataframe
vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf = vectorizer.fit_transform(df["genre"])
# Transform the provided title into a vector using the vectorizer
query_vec = vectorizer.transform([title])
# Calculate the cosine similarity between the title vector and the genre vectors
similarity = cosine_similarity(query_vec, tfidf).flatten()
# Select the indices of the top 100 most similar content
indices = np.argpartition(similarity, -10)[-100:]
# Select the rows of the dataframe corresponding to the selected indices, and sort them in descending order of similarity
results = df.iloc[indices].iloc[::-1]
# Remove duplicates based on the 'title' column
results = results.drop_duplicates(subset=['title'])
# Fetching only the title values of the movie and converting it to a list
content_reco = results.title.values.tolist()
# Return the list of the titles of the content based recommended content
return content_reco
def collaborative_recommendation(title):
"""
This functions leverages the collaborative filtering approach for recommending movies based on the ratings given to them
by the users. An index of the array is taken out first and the index is given for the similarity score calculation and
top 5 results are returned.
"""
#Fetching Index of the movie
try:
index = np.where([pivot.columns==title])[0][0]
except:
pass
#Finding the similar movies using the similarity score and fetching top-n results
similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:5]
#initiating a list
data = []
#Initiating a loop to loop through the similar items
for i in similar_items:
item = []
#Creating a temporary dataframe where the title in the database and the search keyword is the same.
temp_df = df[df['title'] == pivot.columns[i[0]]]
#Adding the item to the list after dropping duplicates based on the title column in the temporary dataframe
item.extend(list(temp_df.drop_duplicates('title')['title'].values))
#Adding the item object to the data list (for all the similar items)
data.append(item)
#return the list of collaborative filtering recommended movies
return data
def hybrid_recommendation(title):
"""
This function utilizes the capabilities of the earlier functions to provide a single set of unique recommendations
utilizing content and collaborative filtering (hybrid recommendations). The output is a list of those recommendations.
"""
#get the list of recommended movies from the content-based system
content_recommended_movies = np.unique(content_recommendation(title)).tolist()
#get the list of recommended movies from the collaborative filtering system
collaborative_recommended_movies = np.unique(collaborative_recommendation(title)).tolist()
#combine the two lists
recommended_movies = (content_recommended_movies + collaborative_recommended_movies)
#return the combined list of recommended movies
return recommended_movies
CONCLUSION
- A recommender system, or a recommendation system (sometimes replacing ‘system’ with a synonym such as a platform or an engine), is a subclass of information filtering systems that provide suggestions for items that are most relevant to a particular user.
- There are various types of recommender systems such as Popularity Based Recommender Systems, Content-Based Recommender Systems, Collaborative Filtering Recommender Systems, and Hybrid Recommender Systems
- There are several challenges with Content-Based and Collaborative Filtering Recommender Systems which gives rise to the need for Hybrid recommender systems
- Filtering by average ratings given to a product and the number of ratings given to a product is a good idea to ensure that the quality recommendations are only being made to the users.
The link to the full code is here.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website.
If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).