Recommendation Engine Explained

Mohamed Fawzy

Published in

tajawal

12 min readAug 9, 2018

Github source code examples can be founded here:
https://github.com/MohamedFawzy/recommendation-engine

Introduction:

Have you wondered how amazon recommend items to you ? or netflix recommend content for you , spotify and youtube Here i will summarize things as much as possible .

Recommendation engine a branch of information retrieval and artificial intelligence , are powerful tools and techniques to analyze huge volumes of data , especially product information and user information , and then provide relevant suggestions based on data-mining approaches .

In tech terms recommendation engine problem is to develop mathematical model or objective function which can predict how much user will like an item .

Get how much user will like this item

Recommender systems collect informations on the preferences of it’s users for set of items (e.g movies , songs , books, jokes, gadgets) .The information can be acquired explicitly ( typically by collecting user’s ratings ) .The information can be acquired implicit ( typically by monitoring users’ behavior such as song heard , website visited and books read ).

Recommender system may use demo-graphic features of user such as (age, nationality, gender, etc ) .Social information like followers , tweets, posts is commonly used in recommender systems .

There’s growing tend toward the use of information from internet of things ( e.g GPS location , RFID, real-time signals ).Recommender system make use of different sources of information for providing users with predictions and recommendations of items . they try to balance factors like accuracy , novelty, dispersity and stability .

Collaborative filtering (CF) methods play an important role in the recommendation also they’re often used along with other techniques like content-based, knowledge-based or social ones .

Cf is based on the way which humans have made decisions throughout history.The most common research papers focused on movie recommendation studies however a great volume of literatures for RS is centered on different topics such as music , e-commerce, books, web search and others.

Types of recommender systems:

Collaborative filtering .
Content-based recommender systems .
Hybrid recommender systems .
Context-aware recommender systems .

Collaborative filtering :

Collaborative filtering recommender systems are basic forms of recommender engines. In this type of recommendation engine, filtering items from a large set of alternative is done collaborative by users preferences.

The basic assumption in collaborative filtering is that of two users shared the same interest in the past they will also have similar taste in the future for e.g user A and user B have similar movie preferences , and user A recently watched titanic which user B has not yet seen, then the idea is to recommend this unseen movie to user B .

Types of collaborative filtering:

User-based .
Item-Based

User-based collaborative filtering: in user based CF recommendations are generated by considering the preferences in the user’s neighborhood user-based CF is done in two steps:

1- Identify similar users based on similar user preferences.

2- Recommend new items to an active user based on rating given by similar user on the items not rated by active user .

Item-based collaborative filtering : in item based CF , the recommendation engine are generated using the neighborhood of items unlike user-based, we first find similarities between items and then recommend non-rated items which are similar to the items the active user has rated in the past . Item-based recommender system are constructed in two steps :

1- Calculate the item similarity based on the item preferences .

2- Find the top similar items to the non-rated items by active user and recommend them .

Pros:

Very simple

Cons:

Hang in case of cold start problem.

Example :

import numpy as np
import pandas as pd
import os
import matplotlib as mpl
if os.environ.get('DISPLAY','') == '':
    print('no display found. Using non-interactive Agg backend')
    mpl.use('Agg')

import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
import sklearn
# set path for data
current_working_dir = os.getcwd()
print(current_working_dir)
path = current_working_dir + "/ml-100k/u.data"
print("Reading file ================================>>>\n")
print(path)
column_names = ['userId' , 'itemId' , 'ratings' , 'timestamp']
df   = pd.read_csv(path, sep ="\t", header= None, names = column_names)

print(type(df))
print("===================\n")
# get first six results of the data frame to have a look at how data seems to be using
print(df.head())
# print columns for dataframe
print(df.columns)
# print shape for dataframe
print(df.shape)
# plot dataframe for rating
plt.hist(df['ratings'])
plt.show()

# counts ratings
count_ratings = df.groupby(['ratings'])['userId'].count()
print(count_ratings)
# distribution of movie views
plt.hist( df.groupby(['itemId'])['itemId'].count() )
plt.show

# total number of unique users in dataset
n_users = df.userId.unique().shape[0]
# total number of unique movies in dataset
n_movies = df['itemId'].unique().shape[0]

print(str(n_users) + ' users')
print(str(n_movies) + ' movies')

# create matrix of zeros with n_users * n_movies to store the ratings in the cell of matrix ratings
ratings = np.zeros((n_users, n_movies))
print(ratings)
# foreach tuple in dataframe df extract the information from each column of the row and store it in the rating matrix cell value
for  row in df.itertuples():
   ratings[row[1]-1, row[2]-1] = row[3]

print(type(ratings))
# get shape for the array of count_ratings
print(ratings.shape)
# sample data for how ratings looks like
print(ratings)

# get sparsity in the dataset
# Hint sparsity represent the ratings exist in dataset e.g if we have only 6.3% that means only 6.3% from the dataset has ratings and others has zeros
# Hint zeros means are empty rating
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('Sparsity: {:4.2}%'.format(sparsity))
# create training set and test set with values 0.33 for test dataset and 42% as Training dataset
ratings_train, ratings_test = train_test_split(ratings, test_size= 0.33, random_state=42)
# dimensions of the train set, test set
print("Training set shape")
print(ratings_train.shape)

print("Testing set shape")
print(ratings_test.shape)
# predict the user's rating for an item is give by the weighted sum of all other user's ratings for that item.
print("""
###########################################################
#User Based CF                                            #
# 1- Creating similarity matrix between n_users using     #
# cosine similarity.                                      #
#                                                         ###############
# 2- Prediciting unkown rating for item i                 ############################
# for an active user u by calcauting                      ##############################
# the weighted sum of all the users for the item          #####################################
#                                                         ##########################################
# 3- Recommending the new items to the user               ##############################################
#                                                         ################################################
#                                                         ###################################################
###########################################################
""")

# calcaute the similarity using cosine distance
dist_out = 1 - sklearn.metrics.pairwise.cosine_distances(ratings_train)
# the type of the distance matrix will be the same type of the rating matrix
print(type(dist_out))
# the deminsion of the matrix will be a square matirx of size equal to the number of users .
print(dist_out.shape)
print(""" \n\n <<<<<<<<< sample dataset of the distance matrix >>>>>>>>>>>>>> """)
print(dist_out)
# Prediciting unkown ratings for a user
user_pred = dist_out.dot(ratings_train) / np.array([np.abs(dist_out).sum(axis=1)]).T
print("\n\n user prediction")
print(user_pred)
# error function for the model
def get_mse(pred, actual):
    # ignore nonzeros items
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)
# function call to get mse for dataset
print("mse for Training")
print(get_mse(user_pred, ratings_train))
# get model accuracy for test dataset
print("mse for Testing")
print(get_mse(user_pred, ratings_test))

Content-Based:

In this type of recommender systems instated of consider only user-based item-based preferences though it’s accurate.It make more sense if we consider user properties and item properties while building the recommendation engine.Using content information of the items for building the recommender model.

A content recommender system typically contains a user- profile-generation step, item-profile-generation step and model building-generation step to generate recommendation for an active user.

The content-based recommender system recommends items to users by taking the content or features of items and user profiles. As an example, if you have searched for videos of Lionel Messi on YouTube, then the content-based recommender system will learn your preference and recommend other videos related to Lionel Messi and other videos related to football.

In simpler terms, the system recommends items similar to those that the user has liked in the past. The similarity of items is calculated based on the features associated with the other compared items and is matched with the user’s historical preferences.

e.g of recommender system for movies using content-based

Hybrid system:

Built by combining various recommender systems to build more robust system. By combining various recommender systems, we can replace disadvantage of one of system with the advantages of another system and thus build a more robust system. For example by combining collaborative filtering methods where model fail with cold start problem with content-based systems, where features information about the items are available , new items can be recommended more accurately and efficiently.

For example, if you are a frequent reader of news on Google News, the underlying recommendation engine recommends news articles to you by combining popular news articles read by people similar to you and using your personal preferences, calculated using your previous click information.

With this type of recommendation system, collaborative filtering recommendations are combined with content-based recommendations before pushing recommendations.

Context-aware:

Personalized recommender systems such as content-based recommender systems , are inefficient , they fail to suggest recommendations with respect to context .For example assume lady is very fond of ice-cream. Also assume this lady goes to cold place.

Now here is high chance that a personalized recommender systems suggest a popular ice-cream brand .

Is it make any sense to suggest ice-cream to here in cold place or system should suggest coffee ? This type of recommender systems which is personalized and context-aware called context-aware .

User preferences may differ with the context such as:

Time
Day
Season
Mood
Place
Location

Example for map with users in Recommender system .

Foundation:

This section present the most relevant concepts on which the traditional recommender systems based .Provide general descriptions on the classical taxonomies, algorithms , methods, filtering approaches , database , etc .

After that will describe cold start problem which will illustrate the difficulty of making collaborative filtering recommendation when the recommender system have small amount of data .

Fundamentals:

The process for generating an RS recommendation is based on combinations of the following considerations :

The type of data available in the database e.g(ratings, user registrations information, features and content for items that can be ranked, social relationships among users and location aware information )
The filtering algorithm used (demographic, content-based, collaborative, social based, context-aware and hybrid).
The mode chosen(e.g based on direct use of data “memory-based” or a model generated using such data “model-based”)
The employed technique are also considered: probabilistic approaches : Bayesian networks, nearest neighbour algorithm; bio-inspired algorithm such as neural networks and genetic algorithm , fuzzy models, singular value decomposition techniques to reduce sparsity levels etc.
Sparsity level of the database and the desired scalability.
Performance of system ( time and memory consuming)
The desired quality of the results (e.g novelty, coverage and precision).

Datasets:

Through these databases, the scientific community can replicate experiments to validate and improve their techniques the current public databases referenced most often in research are

last.fm
delicious
movielens
netflix
jester
each movie
book crossing

The internal functions for recommender system are characterized by the filtering algorithm the most widely used classifications divides the filtering into:

Collaborative filtering
Demo-graphic filtering
Content-based filtering
Hybrid filtering

Content-based filtering : makes recommendations based on user choice made in the past (e.g web-based , e-commerce )

Demographic-filtering : is justified on the principle that individuals with certain common personal attributes (sex, age, country, etc.) will also have common preferences.

Collaborative filtering allows users to give ratings about a set of elements (e.g videos, songs , films , etc ) .

The most widely used algorithm for collaborative filtering is the k nearest neighbor (KNN) algorithm .

In the user to user version, kNN executes the following three tasks to generate recommendations for an active user:

Determine k users neighbors (neighbor- hood) for the active user a.
Implement an aggregation approach with the ratings for the neighborhood in items not rated by a.
Extract the predictions from in step 2 then select the top N recommendation.

Hybrid filtering commonly uses a combination of CF with demographic filtering or CF with content-based filtering to exploit merits for each one of these techniques .

Recommendation categories based on model:

Memory based: can be defined as method that act only on the matrix of user rating items , use any rating generated before the referral process.
Model based: usually use similarity metrics to obtain the distance between two users, or two items, based on each of their ratios

Foundation — Cold start:

The cold start problem occurs when it’s not possible to make reliable recommendation due to an initial lack of ratings .

Types of cold start:

New community.
New item.
New user.

New community : problem refers to the difficulty when starting a new RS. In obtaining a sufficient amount of data ratings for making reliable recommendations, two commons way are used for tackling this problem
- Encourage users to make rating through gamification model .
- To take CF when there are enough data for users and ratings .

New item : arises because new items added to RS don’t usually have initial ratings , and therefore they are not likely to be recommended . Then a lot of users will never seen this items . A common solution for this problem is to have a set of motivated users who are responsible for rating each new item in the system .

New user: one of greatest difficulties faced by RS . Since new user has not provided any ratings yet in RS.They cannot receive any personalized recommendation based on memory-based CF.The common strategy to tackle this problem consist of turning to additional information to the set of ratings in order to be able to make recommendation based on the data available for each user .

Similarity measure :

A metric or a Similarity Measure (SM) determines the similarity between pairs of users (user to user CF) or the similarity between pairs of items (item to item CF).

For this purpose, we compare the ratings of all the items rated by two users (user to user) or the ratings of all users who have rated two items (item to item) .

The KNN algorithm is based essentially on the use of traditional similarity metrics of statistical origin. These metrics require, as the only source of information, the set of votes made by the users on the items (memory-based CF).

Techniques used to measure Similarity:

Pearson correlation (CORR)
Cosine (COS)
Adjusted cosine (ACOS)
Constrained correlation (CCORR)
Mean Squared Differences (MSD)
Euclidean (EUC)

Evaluation:

The most commonly used quality measurement are the following

Prediction evaluation.
Evaluations for recommender system as a sets.
Evaluations for recommendations as ranked list.

Evolution metrics can be classified as:

Prediction metric such as the accuracy ones: Mean Absolute Error (MAE), Root of Mean Square Error (RMSE), Normalized Mean Average Error (NMAE) and the coverage.
Set recommendation metrics: such as Precision, Recall and Receiver Operating Characteristic.
Rank recommendation metrics: such as the half-life and the discounted cumulative gain
Diversity metrics: such as the diversity and the novelty of the recommended items.

References:

Papers:

YOUR PRIVACY PROTECTOR: A RECOMMENDER SYSTEM FOR PRIVACY SETTINGS IN SOCIAL NETWORKS — International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 4, August 2013.
The Netflix Recommender System: Algorithms, Business Value, and Innovation — Carlos A. Gomez-Uribe and Neil Hunt. 2015.
Social Network and Tag Sources Based Augmenting Collaborative Recommender System — IEICE TRANS. INF. & SYST., VOL.E98–D, NO.4 APRIL 2015.
Wide & Deep Learning for Recommender Systems — Google Inc. — DLRS ’16 September 15–15, 2016, Boston, MA, USA.
Health-aware Food Recommender System — RecSys 2015, Vienna Austria.
Recommender systems survey — Knowledge-Based Systems — 2013.
Collaborative Filtering and Deep Learning Based Recommendation System For Cold Start Items — 2016.
Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning — 2017.
Deep Neural Networks for YouTube Recommendations — 2016.

Books:

Building Recommendation Engine.
Machine learning for web.

Links:

https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3