Recommendation System Using (Rank, User-User, Matrix Factorization) on Streaming App

Mohammed Alrashidan
15 min readFeb 24, 2022

--

Business Understanding

Do you ever wonder how your friends saw similar movies and none of you recommended it to each other but still all of you saw it? OR everyone on the streaming app saw it!

Let’s take Squid Game for example, we could say, its an incredible show! but what else? what drove every user to make a decision to watch it? The short answer is Recommendation System.

Intro to Recommendation System:

It is a machine learning algorithm that utilize users’ preferences and historical decisions all together and output a recommendation to each users. Simply, if you like actions movies, you are more likely will be recommended with more action movies from other users that watched action movies.

It is by far one of the best features that allow users to make faster decision or relate to other contents based on their previous likings.

Types of Recommendation Systems:

Knowledge Based Recommendation System:

Knowledge based are very informative technique that help users make decisions based on specific information about the item characteristics such as using filtering in Amazon to only get Prime Delivery. Also, other technique using Rank-Based recommendation such as Netflix to get top 10 Movies.

2. Collaborative Filtering Recommendation System:

It is another technique that is based on users interactions with other users using similarities with items. There is also a Model Based which is more advanced and using modeling and Machine Learning to validate recommendations to users and how accurate our model perform.

3. Content Based Recommendation System:

Mostly, recommendations are derived from metadata of users and items, and by leveraging item information. For example, user X bought a book and we could use the book contents such as book author, description or genre to find other similar books to the same user X because we could assume that this user is more likely to buy other books similar to these criteria.

Data Understanding

Now we will try to implementing Recommendation System by applying these techniques with real streaming data

Data

The data is from Saudi Telecom Company. STC recently shared a new initiative for Open Data for the public. https://lab.stc.com.sa/dataset/en/

Data Collection:

STC offer a lot of tech services including JAWWY which is a streaming movies and tv shows. the data contains over 3 million records of user activity using JAWWY Service.

Data Overview

Now less texts more coding!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import matplotlib.image as image
from matplotlib.offsetbox import TextArea, DrawingArea, OffsetImage, AnnotationBbox
%matplotlib inline
stc = pd.read_csv('Final_Dataset.txt', delimiter = ",")
stc = stc.drop('Unnamed: 0', axis=1)# creating unique id for videos
df = stc.copy()
df['vid_id'] = df.groupby(['original_name']).ngroup()
df.head()

Data Processing

# choosing columns we need for recommendation
df = df[['user_id_maped', 'vid_id','program_name', 'program_class', 'season', 'episode', 'program_genre', 'hd', 'date_', 'original_name']]
# rename cols
df.columns = ['user_id', 'vid_id','program_name', 'program_class', 'season', 'episode', 'program_genre', 'hd', 'date', 'original_name']
# create date year and months
df['year'] = [int(x[:4]) for x in df['date']]
df['month'] = [int(x[5:7]) for x in df['date']]
df['date'] = pd.to_datetime(df.date,format='%Y/%m/%d')
# fixing some names from the dateframe
df.program_class = df.program_class.replace('MOVIE', "Movie")
df.program_class = df.program_class.replace('SERIES/EPISODES', "TV Show")
df.program_genre = df.program_genre.replace('SERIES_NOT_ADDED_UNDER_ANY_GENRE', "Others")
df.program_genre = df.program_genre.replace('NOT_DEFINED_IN_UMS', "Others")

Exploratory Data Analysis

# check for nulls
df.isna().sum()
# check for duplicated
print('Number of Duplicate Records {}'.format(df.duplicated().sum()))
print('Percentage of Duplicates on Full Data %{}'.format(round(df.duplicated().sum()/df.shape[0]*100,2)))

Number of Duplicate Records 1027754
Percentage of Duplicates on Full Data %28.56

Outliers

user_act = df.groupby(by='user_id')['vid_id'].count()before_outlier = user_act.values
ax = sns.distplot(before_outlier)

Lets discuss this outlier of max number of users which can tell us something important. We understand that Jawwy service is similar to Netflix and know that users watch a lot of content but 11944 is very huge number for someone to watch. There are two assumptions:

  1. I could assume that its some duplicated in the database of same user, could be a testing user generally a but
  2. I would surly assume that Jawwy lack of an important feature in their service which is limiting users going over a period of time. Let me explain further: you have a feature that allow to go the next episode automatically, right! How long is the queue? 3 seconds? if yes, then users will jump faster to other episode. Where is the Problem? The problem is in the sleeping people who let the show go on for ever and the service will loop endless to all contents if the user did not touch the remote. Solution, simply, “Are You Still Watching?”. Simple notification can save you from useless data records in the database and keep your models efficient from data leakage and better numbers.
# removing outliers
def remove_outlier_IQR(df):
outs = df.describe()
Q1=outs['25%']
Q3=outs['75%']
IQR=Q3-Q1
df_final=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df_final
user_act = remove_outlier_IQR(user_act)
import seaborn as sns, numpy as np
after_re_outlier = user_act.values
ax = sns.distplot(after_re_outlier)

Data Visualization

# Palette
sns.palplot(['#6A01BB', '#390065', '#FF375D','#e8591b','#f5f5f1' ])
plt.title("STC brand palette ",loc='left',fontsize=15,y=1)
plt.show()

1. Percentage of Program Type

Processing

type_ratio = round(df.program_class.value_counts(normalize=True).to_frame().transpose(),2)

Visualization

fig, ax = plt.subplots(1,1,figsize=(6.5, 2.5))ax.barh(type_ratio.index, type_ratio['Movie'], 
color='#FF375D', alpha=0.9)
ax.barh(type_ratio.index, type_ratio['TV Show'], left=type_ratio['Movie'],
color='#6A01BB', alpha=0.9)
ax.set_xlim(0, 1)
ax.set_xticks([])
ax.set_yticks([])
# movie percentage
for i in type_ratio.index:
ax.annotate(f"{int(type_ratio['Movie'][i]*100)}%",
xy=(type_ratio['Movie'][i]/2, i),
va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='serif',
color='white')
ax.annotate("Movie",
xy=(type_ratio['Movie'][i]/2, -0.25),
va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='serif',
color='white')


for i in type_ratio.index:
ax.annotate(f"{int(type_ratio['TV Show'][i]*100)}%",
xy=(type_ratio['Movie'][i]+type_ratio['TV Show'][i]/2, i),
va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='serif',
color='white')
ax.annotate("TV Show",
xy=(type_ratio['Movie'][i]+type_ratio['TV Show'][i]/2, -0.25),
va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='serif',
color='white')
for s in ['top', 'left', 'right', 'bottom']:
ax.spines[s].set_visible(False)
# -------------------- Title , Insight , Source --------------------
n = 0.5
fig.text(0, 1.2+n, 'STC Jawwy Movie & TV Show Distribution', fontsize=22, fontweight='bold', fontfamily='serif', color='grey')
fig.text(0, 1.1+n, """Percentage of Content Type in Catelog""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
fig.text(0, 0.8+n, "[Insight]",fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0, 0.5+n, """
We can see that users are fairly streaming
Movies and TV Shows closly although TV Shows are higher
"""
,fontsize=15, fontweight='light', fontfamily='serif', color='grey')
import matplotlib.lines as lines
l1 = lines.Line2D([0, 1], [0.95, 0.95], transform=fig.transFigure, figure=fig,color='black',lw=0.3)
fig.lines.extend([l1])
fig.text(0, -0.5+n, "Source: STC Lab OpenData",fontsize=11, fontweight='bold', fontfamily='serif', color= '#e8591b')# ------------------------------------------------------------------plt.show()

2. Genre Percentage Collection in JAWWY

Processing

genre_total = df.groupby(by=['program_class', 'program_genre'],as_index=False)['user_id'].count().sort_values('program_genre',ascending=False)
# remove any non duplicate genre
dups = genre_total['program_genre'].value_counts()
genre_total = genre_total[genre_total['program_genre'].isin(dups[dups>1].index)]
genre_total = genre_total.pivot_table(index='program_genre',columns='program_class', values='user_id').reset_index()
genre_total.columns = [''.join(col) for col in genre_total.columns]
genre_total['Total'] = genre_total['Movie']+genre_total['TV Show']
genre_total = genre_total.sort_values('Movie', ascending=True)
genre_total['Movie'] = round(genre_total['Movie']/genre_total['Total']*100,2)
genre_total['TV Show'] = round((genre_total['TV Show'])/genre_total['Total']*100,2)
genre_total = genre_total.set_index('program_genre')
genre_total = genre_total.sort_values('Movie', ascending=False)

Visualization

fig, ax = plt.subplots(1,1,figsize=(15, 8),)
ax.barh(genre_total.index, genre_total['Movie'],
color='#ff375e', alpha=0.8, label='Movie')
ax.barh(genre_total.index, genre_total['TV Show'], left=genre_total['Movie'],
color='#390065', alpha=0.8, label='TV Show')
#ax.set_xticks([])
ax.set_yticklabels(genre_total.index,
fontfamily='serif', fontsize=11)
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False)
# male percentage
for i in genre_total.index:
ax.annotate(f"{genre_total['Movie'][i]:.4}%",
xy=(genre_total['Movie'][i]/2, i),
va = 'center', ha='center',fontsize=12, fontweight='bold', fontfamily='serif',
color='white')
for i in genre_total.index:
ax.annotate(f"{genre_total['TV Show'][i]:.4}%",
xy=(genre_total['Movie'][i]+genre_total['TV Show'][i]/2, i),
va = 'center', ha='center',fontsize=12, fontweight='bold', fontfamily='serif',
color='white')


#------------------------ Title , Insight , Source -----------------
n = 0.01
fig.text(0, 1.24+n, 'STC Jawwy Genre Collection', fontsize=22, fontweight='bold', fontfamily='serif', color='grey')
fig.text(0, 1.2+n, """Total TV Shows and Movies Watched Combined""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
fig.text(0, 1.1+n, "[Insight]",fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0, 0.97+n, """
TV Show Genre in Jawwy is dominated by Sci-Fi, Family and Animation contents with more than 50%.
Movies have high selection of all other geners. We noticed that 'Others' genre is taking high
portion in movies which mean there are a lot of movies we not categorized
"""
,fontsize=15, fontweight='light', fontfamily='serif', color='grey')
import matplotlib.lines as lines
l1 = lines.Line2D([0, 0.9], [0.95, 0.95], transform=fig.transFigure, figure=fig,color='black',lw=0.3)
fig.lines.extend([l1])
fig.text(0.755, 1.14, "STC Lab OpenData",fontsize=11, fontweight='bold', fontfamily='serif', color= '#e8591b')# ------------------------------------------------------------------fig.text(0.7,0.86,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#FF375D')
fig.text(0.75,0.86,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.76,0.86,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#390065')
for s in ['top', 'left', 'right', 'bottom']:
ax.spines[s].set_visible(False)
ax.tick_params(axis='both', which='major', labelsize=12)
ax.tick_params(axis=u'both', which=u'both',length=0)
im = image.imread('STC.png')
newax = fig.add_axes([0.75,1.16,0.12,0.12], anchor='NE', zorder=1)
newax.imshow(im)
newax.axis('off')
plt.show()

3. High Definition Support

Processing

#Number of HD in the Catlog
hd_size = df[['hd', 'vid_id']].drop_duplicates()
hd_size = hd_size.groupby(by="hd", as_index=False)['vid_id'].count()
hd_size['hd'] = np.where(hd_size['hd'] == 1, 'HD', 'SD')

Visualization

# create variables for data and labels and colors
pie_data = hd_size.loc[:,'vid_id']
pie_labels = hd_size.loc[:,'hd']
pie_color=['#390065','#ff375e']
# Make figure and axes
fig, ax = plt.subplots(1,1, figsize=(12, 5),)
# Adapt radius and text size for a smaller pie
patches, texts, autotexts = ax.pie(pie_data,
colors=pie_color,
autopct='%.0f%%',
textprops={'size': 30, 'color': 'black'},
shadow=True,
radius=2,
startangle=70, # change the angle to make small on the right side
wedgeprops=dict(width=1.2), # create donut chart
explode=(0,0.4) # make the small standout
)
fig.text(0.44,0.5,"SD", fontweight="bold", fontfamily='serif', fontsize=30, color='#390065')
fig.text(0.5,0.5,"|", fontweight="bold", fontfamily='serif', fontsize=30, color='black')
fig.text(0.52,0.5,"HD", fontweight="bold", fontfamily='serif', fontsize=30, color='#FF375D')
# --------------- Title , Insight , Source -------------------------n = 0.3
plt.setp(autotexts, size=22, weight="bold", color="white")
fig.text(0, 1.35+n, 'STC Jawwy High Definition Support', fontsize=22, fontweight='bold', fontfamily='serif', color='grey')
fig.text(0, 1.28+n, """Total TV Shows and Movies Combined""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
fig.text(0, 1.15+n, "Insight:",fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0, 0.95+n, """
As we notice that we have SD contents have higher
collection than HD in Jawwy IPTV. SD have about 1642 contents while
HD contents only have 549 movies and TV Shows
""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
import matplotlib.lines as lines
l1 = lines.Line2D([0, 0.9], [0.9+n, 0.9+n], transform=fig.transFigure, figure=fig,color='black',lw=0.3)
fig.lines.extend([l1])
l2 = lines.Line2D([0, 0.9], [0.9+n, 0.9+n], transform=fig.transFigure, figure=fig,color='black',lw=0.3)
fig.lines.extend([l2])
fig.text(0.744,1.45, "STC Lab OpenData",fontsize=11, fontweight='bold', fontfamily='serif', color= '#e8591b')# ------------------------------------------------------------------ax.tick_params(axis='both', which='major', labelsize=12)
ax.tick_params(axis=u'both', which=u'both',length=0)
im = image.imread('STC.png')
newax = fig.add_axes([0.7,1.5,0.2,0.2], anchor='NE', zorder=1)
newax.imshow(im)
newax.axis('off')
plt.show()

4. User Views Overtime by Type

Processing

df['date_column']= df['date'].dt.to_period('M')
user_interactions = df.groupby(by=['date_column', 'program_class'], as_index=False).agg({"user_id": 'count'}).set_index('date_column')
user_interactions = pd.pivot_table(user_interactions, index='date_column', columns='program_class', values='user_id')
user_interactions.index= user_interactions.index.strftime('%y-%b')
date= user_interactions.index.astype(str)

Visualization

fig, ax = plt.subplots(figsize=(15, 8),)ax.plot(date, user_interactions['Movie'], color='#FF375D', marker='o', lw=0.5)
ax.plot(date, user_interactions['TV Show'], color = '#390065', marker='o', lw=0.5)
# -------------------------------------- Title , Insight , Source --------------------------------------
n = 0.3
plt.setp(autotexts, size=22, weight="bold", color="white")
fig.text(0, 0.94+n, 'STC Jawwy User Streaming By Months', fontsize=22, fontweight='light', fontfamily='serif', color='grey')
fig.text(0, 0.9+n, """Monthly TV Shows and Movies Activity""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
fig.text(0, 0.84+n, "Insight:",fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0, 0.72+n, """
Both Movies and TV Shows decreased on june then only TV Shows
spkiked within 3 months toping highest veiws for the rest of months.
""",fontsize=15, fontweight='light', fontfamily='serif', color='grey')
import matplotlib.lines as lines
l1 = lines.Line2D([0, 0.9], [0.7+n, 0.7+n], transform=fig.transFigure, figure=fig,color='black',lw=0.3)
fig.lines.extend([l1])
fig.text(0.77, 1.11, "STC Lab OpenData",fontsize=11, fontweight='bold', fontfamily='serif', color='#e8591b')
# -----------------------------------------------------------------------------------------------
fig.text(0.68,0.9,"TV Shows", fontweight="bold", fontfamily='serif', fontsize=30, color='#390065')
fig.text(0.68,0.44,"Movies", fontweight="bold", fontfamily='serif', fontsize=30, color='#FF375D')
for s in ['top', 'left', 'right', 'bottom']:
ax.spines[s].set_visible(False)
ax.set_yticklabels(['80K','100K','120K','140K', '160K', '180K', '190K', '200K'], fontsize=20, color='grey')
ax.set_ylabel('Views', fontsize=20, color='grey')
ax.set_xticklabels(list((user_interactions.index)),fontsize=15, color='grey')
im = image.imread('STC.png')
newax = fig.add_axes([0.75,1.13,0.15,0.15], anchor='NE', zorder=1)
newax.imshow(im)
newax.axis('off')

Recommendation System

Part I: Rank-Based Recommendations

In Rank-Based Recommendations, we care about the number of content was interacted by users. By building a function to list a n top items can be insightful!

def get_top_items(n, df=df):
'''
INPUT:
n - (int) the number of top top_items to return
df - (pandas dataframe) df as defined at the top of the notebook

OUTPUT:
top_items - (list) A list of the top 'n' items titles

'''
# Your code here
top_items = list(df.groupby('original_name').count()['user_id'].sort_values(ascending=False).index[:n,])

return top_items # Return the top items titles from df (not df_content)
def get_top_items_ids(n, df=df):
'''
INPUT:
n - (int) the number of top_items to return
df - (pandas dataframe) df as defined at the top of the notebook

OUTPUT:
top_items - (list) A list of the top 'n' items titles
'''
top_items = list(df.groupby('vid_id').count()['user_id'].sort_values(ascending=False).index[:n,])

return top_items # Return the top item ids
top_10 = list(get_top_items(10))
top_10

Part II: User-User Based Collaborative Filtering

We will use the function below to reformat the df dataframe to be shaped with users as the rows and items as the columns.

  • Each user should only appear in each row once.
  • Each item should only show up in one column.
  • If a user has interacted with an item, then place a 1 where the user-row meets for that item-column. It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an item should be a 1.
  • If a user has not interacted with an item, then place a zero where the user-row meets for that item-column.
# I am subsetting the dataset since it will take forever on my machine to compute similarity
df = df.head(100000)
# create the user-item matrix with 1's and 0's
def create_user_item_matrix(df):
'''
INPUT:
df - pandas dataframe with item_id, title, user_id columns

OUTPUT:
user_item - user item matrix

Description:
Return a matrix with user ids as rows and item ids on the columns with 1 values where a user interacted with
an item and a 0 otherwise
'''
# Fill in the function here
df = df[['user_id', 'original_name', 'vid_id']]
user_item = df.groupby(['user_id', 'vid_id'])['vid_id'].count().unstack().notnull().astype(int)

return user_item # return the user_item matrix
user_item = create_user_item_matrix(df)

Calculating User Similarities

def find_similar_users(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user_id
user_item - (pandas dataframe) matrix of users by items:
1's when a user has interacted with an items, 0 otherwise

OUTPUT:
similar_users - (list) an ordered list where the closest users (largest dot product users)
are listed first

Description:
Computes the similarity of every pair of users based on the dot product
Returns an ordered

'''
# compute similarity of each user to the provided user
similarity=user_item.dot(np.transpose(user_item))
# sort by similarity
similarity_sort=similarity.sort_values(by=user_id, axis=1, ascending=False)

# remove the own user's id
similarity_sort = similarity_sort.drop(columns=user_id)
# create list of just the ids
most_similar_users = similarity_sort.columns.tolist()

return most_similar_users # return a list of the users in order from most to least similar
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(15)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(34254)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(57)[:3]))
df['vid_id'] = df['vid_id'].astype(str)def get_item_names(item_ids, df=df):
'''
INPUT:
item_id - (list) a list of item ids
df - (pandas dataframe) df as defined at the top of the notebook

OUTPUT:
item_names - (list) a list of item names associated with the list of item ids
(this is identified by the title column)
'''
is_items = df['vid_id'].isin(item_ids)
item_names = list(set(df[is_items]['original_name'].values))
return item_namesdef get_user_items(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user id
user_item - (pandas dataframe) matrix of users by items:
1's when a user has interacted with an items, 0 otherwise

OUTPUT:
items_ids - (list) a list of the items ids seen by the user
items_names - (list) a list of items names associated with the list of items ids
(this is identified by the doc_full_name column in df_content)

Description:
Provides a list of the items_ids and items titles that have been seen by a user
'''
items_ids = user_idx[user_item.loc[user_id] == 1]
items_ids = list(items_ids.index.astype('str'))
items_names = get_item_names(items_ids)

return items_ids, items_names

Once we have have 4 functions ready to convert df to matrix, get user id ,item name and user items, we can can now build last recommendation function to input a user id and out put a list of movies/tv shows to recommend

def get_top_sorted_users(user_id, df=df, user_item=user_item):
'''
INPUT:
user_id - (int)
df - (pandas dataframe) df as defined at the top of the notebook
user_item - (pandas dataframe) matrix of users by item:
1's when a user has interacted with an item, 0 otherwise


OUTPUT:
neighbors_df - (pandas dataframe) a dataframe with:
neighbor_id - is a neighbor user_id
similarity - measure of the similarity of each user to the provided user_id
num_interactions - the number of items viewed by the user - if a u

Other Details - sort the neighbors_df by the similarity and then by number of interactions where
highest of each is higher in the dataframe

'''
# Your code here
# compute similarity of user_id to grab all users
similarity = user_item.dot(user_item.loc[user_id]).sort_values(ascending=False)
# add it to new dataframe and delete user_id from the the dataframe
similarity = pd.DataFrame(similarity[1:]).reset_index()
# rename columns
similarity.columns = ['user_id','similarity']
# set another metric of number of interaction of each user in the df
num_interations = df.groupby('user_id').count()['vid_id']
num_interations = pd.DataFrame(num_interations[1:]).reset_index()
num_interations.columns = ['user_id', 'num_interations']
# join num_interations to similarity dataframe
neighbors_df = similarity.merge(num_interations, how='outer', on='user_id')
neighbors_df.columns = ['user_id', 'similarity', 'num_interations']
neighbors_df = neighbors_df.sort_values(by=['similarity', 'num_interations'], ascending=False)

return neighbors_df # Return the dataframe specified in the doc_string
def user_user_recs_part2(user_id, m=10):
'''
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user

OUTPUT:
recs - (list) a list of recommendations for the user by item id
rec_names - (list) a list of recommendations for the user by item title

Description:
Loops through the users based on closeness to the input user_id
For each user - finds items the user hasn't seen before and provides them as recs
Does this until m recommendations are found

Notes:
* Choose the users that have the most total item interactions
before choosing those with fewer item interactions.
* Choose items with the items with the most total interactions
before choosing those with fewer total interactions.


'''
# Your code here

# target user items ids
the_user_items = list(set(get_user_items(user_id)[0]))
recs = []
# get all neighbors users df
neighbors_df = get_top_sorted_users(user_id)
# loop through each user and get their items ids and store it in recs (only store unique to our target)
for user in neighbors_df['user_id']:
user_items_ids = get_user_items(user)[0]# get items ids for check
for art in user_items_ids:
if art not in the_user_items:
recs.append(art)
if len(recs) >= m:
break
if len(recs) >= m:
break
# if we have less than m, we will fill in items from top item
if len(recs) < m:
for id in get_top_item_ids(m):
if id not in the_user_items:
recs.append(id)
if len(recs) >= m:
break
rec_names = get_item_names(recs)

return recs, rec_names

as we can see input the user id and number of recommendation to list of movies

Part III: Matrix Factorization

from sklearn.model_selection import train_test_splitdf_train = df.sample(frac=0.7, random_state=25)
df_test = df.drop(df_train.index)
print(f"No. of training examples: {df_train.shape[0]}")
print(f"No. of testing examples: {df_test.shape[0]}")
def create_test_and_train_user_item(df_train, df_test):
'''
INPUT:
df_train - training dataframe
df_test - test dataframe

OUTPUT:
user_item_train - a user-item matrix of the training dataframe
(unique users for each row and unique items for each column)
user_item_test - a user-item matrix of the testing dataframe
(unique users for each row and unique items for each column)
test_idx - all of the test user ids
test_items - all of the test item ids

'''
user_item_train = create_user_item_matrix(df_train)
user_item_test = create_user_item_matrix(df_test)

test_idx = user_item_test.index.values
test_items = user_item_test.columns.values


return user_item_train, user_item_test, test_idx, test_items
user_item_train, user_item_test, test_idx, test_items = create_test_and_train_user_item(df_train, df_test)

once we split the data we can decompose the data into matrix factorization to compute dot product of training data then validate on test data

From the code below, our goal is measure a simulate the latent features and see the test accuracy performance.

u_train, s_train, vt_train = np.linalg.svd(user_item_train) # fit svd similar to above then use the cells below# grab users_ids and items_ids
train_users_ids = user_item_train.index.values
test_users_ids = user_item_test.index.values
train_users_items = user_item_train.columns.values
test_users_items = user_item_test.columns.values
# now we have common users and items in test and train
common_users = [x for x in list(test_users_ids) if x in train_users_ids]
common_items = [x for x in list(test_users_items) if x in train_users_items]
common_users_items = user_item_test.loc[common_users, common_items]
common_users_items
num_latent_feats = np.arange(10,len(s_test),20)sum_errs_train = []
sum_errs_test = []
for k in num_latent_feats:
# restructure with k latent features
train_s_new, train_u_new, train_vt_new = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
test_u_new, test_vt_new = u_test[:, :k], vt_test[:k, :]

# take dot product
train_user_item_est = np.around(np.dot(np.dot(train_u_new, train_s_new), train_vt_new))
test_user_item_est = np.around(np.dot(np.dot(test_u_new, train_s_new) , test_vt_new))
# compute error for each prediction to actual value
train_diffs = np.subtract(user_item_train, train_user_item_est)
test_diffs = np.subtract(common_users_items, test_user_item_est)
# total errors and keep track of them
train_err = np.sum(np.sum(np.abs(train_diffs)))
test_err = np.sum(np.sum(np.abs(test_diffs)))
sum_errs_train.append(train_err)
sum_errs_test.append(test_err)

Visualization of Matrix Factorization

plt.plot(num_latent_feats, 1 - np.array(sum_errs_train)/df.shape[0]);
plt.plot(num_latent_feats, 1 - np.array(sum_errs_test)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

Analysis

We notice that the higher the latent feature for test we are approaching to lower accuracy and with higher dataset size like train dataset, we get higher accuracy when increasing latent features

Discussion and future improvement for STC Jawwy

  • We could say that accuracy is a great way to measure the given the number of latent features but we could also look for other similarities methods such as (Pearson’s correlation coefficient, Spearman’s correlation coefficient, Kendall’s Tau, Euclidean Distance , Manhattan Distance)
  • It is not sufficient that we have our data split into random as we can pick more users from train that are exist in test, and we can develop a better framework to check for that. also the number of users in test fairly large to work in smaller environments however, the obvious is that all recommendation systems will scale and should have more users.
  • Can we use an online evaluation technique like A/B testing here? Yes, to better improve our dataset, we need more data points for segmentation in order for A/B testing to be effective so that we understand how we can recommend using geo, demo, and personalization recommendations based on users characteristics, and it is always better to use methods such as similarities and num_interaction along other variables to better recommend users
  • from the given data, I only used a subset of 100k data to reduce the computation but we can use the 3M for better server.
  • The data needs more variables to optimize recommendation systems with more accuracy to users. a few of examples would be adding (rating from users, adding program year, item id, description to enhance content-based recommender

Thank for reading this post and I hope you liked the findings. I do a lot data analysis and you can find this notebook posted at Github. Part of my data journey is to (learn > apply > explain).

Lets connect!

Github: https://github.com/mohammedar95

Linkedin: https://www.linkedin.com/in/mohammedar/

--

--