Movie Recs

Published in

Web Mining [IS688, Spring 2022]

4 min readMay 3, 2022

While having recently lost subscribers (and plan to lose more) the Netflix recommendation system to a large extent has sparked similar algorithms to birth throughout 21st-century technology. What became popularized on Netflix as a “what should we next watch” question-answer system, has turned into a “what should I buy next” system and even a “what should I eat today?” system. Some could argue that these systems have even seeped into social media with an alleged Twitter “what would I probably hate and become an expert in today” system. In honor of these recommendation systems, I go back to their roots. Using a user rating and movie dataset, I make a movie recommendation system that would be an interest to anyone who watches movies and would want to know what to see next.

My data set contains 4,465 movies and 2,244 users. Each user has a rating on a movie or no rating at all which can help determine if a user has seen a movie or not. To make the recommendation system, I used Jupyter notebook to write my code, as well as used Pandas and sklearn to manipulate the data necessary to come up with my results. The results can be considered very limited as there are various means to determine how users might be related to each other in movies that were seen. My approach was to use the movies seen and the ratings to group the users into clusters and then determine from those watched movies in those users what might be highly recommended for a special given user. Instead of using clustering as a means of seeing which users are similar, I could have used cosine similarity for example.

Testing the system is difficult, but I was able to compare my results with other similar systems and try to find any overlap that might transpire. This would be considered another limitation. If I would be able to easily implement this system, I would take the ratings from the newly recommended movies applied by the user and see if their watching experience has improved over time.

Running the system 3 times, I randomly selected one of the thousands of users and returned the top 5 recommended movies for each.

An interesting take is that Eternal sunshine and Memento were both on all three of these users’ recommendation lists. Running the system again for another random three users and you see a different set of recommendations for different users, yet with groups that have a lot of common movies.

The results of this can be an indicator of one of two things occurring. The first thing that might be occurring is that the recommendation system is very good and all of these users are more in common than maybe we’d think from this dataset. The other possibility that might be occurring is that some of these movies have, for some reason, a lot of weight to them and appear constantly. This could be valid, as many of the movies were critically acclaimed and therefore are enjoyed by many people. Many high rankings can get these pushed to the top of anyone’s list. But does it necessitate that just because a large group of people liked these movies, a randomly selected user should be recommended them? It’s possible that with these movies critically acclaimed like Shrek if the individual user has not seen it, it’s possible they don’t like green ogres. In fact, 3 of the 5 recommended movies for user 89668 are animated movies. Maybe user 89668 hates animated movies, which is why they have not seen them even though they share other interests with users in their cluster

Below is the code that has led to the above results

def get_user(df : pd.DataFrame) -> str:
    user_list = list(df.index)
    selected_user = random.choice(user_list)
    return selected_userdef get_user_cluster(df : pd.DataFrame, selected_user : str) -> int:
    cluster = df.loc[selected_user][-1]
    return clusterdef run_kmeans(df : pd.DataFrame, clusters : int) -> pd.DataFrame:
    kmeans = KMeans(n_clusters=clusters, random_state=42).fit(df.T)
    new_df = df.T
    new_df['predicted_cluster'] = list(kmeans.labels_)
    return new_dfdef get_not_watched_series(df : pd.DataFrame, selected_user : str) -> list:
    user_series = df.loc[selected_user][:-1]
    not_watched_series = list(user_series[user_series == 0].index)
    return not_watched_seriesdef get_watched_series(df : pd.DataFrame, selected_user : str) -> list:
    user_series = df.loc[selected_user][:-1]
    watched_series = list(user_series[user_series > 0].index)
    return watched_seriesdef from_group_user_not_seen_movies(new_df : pd.DataFrame, seen_movies : list, user_cluster : int, selected_user : str) -> pd.DataFrame:
    group_1 = new_df[new_df['predicted_cluster'] == user_cluster]# Select group2, the cluster 
    group_1.drop([selected_user], inplace = True) # Droop user from group
    group_1.drop(columns = ['predicted_cluster'], inplace = True) # Get rid of predicted cluster, not needed anymore
    group_1.drop(columns = seen_movies, inplace = True) # Drop movies the target user has already seen from list of what other users have seen, these are columns
    final_recommendation_listings = group_1[group_1 > 0].sum(axis = 0).sort_values(ascending = False)
    return final_recommendation_listingsdef find_movies(recommendation_listing_ids : list) ->list:
    movie_name_list = [movie_actor_map[movie_id]['movie'] for movie_id in recommendation_listing_ids]
    return movie_name_listdef get_recommendations_for_user(df : pd.DataFrame, clusters : int) -> pd.DataFrame:
    df = run_kmeans(df, clusters)
    selected_user = get_user(df)
    print(f"For user {selected_user}: ")
    user_cluster = get_user_cluster(df, selected_user)
    not_watched_series = (df, selected_user)
    seen_movies = get_watched_series(df, selected_user)
    recommendation_listing_ids = list(from_group_user_not_seen_movies(df, seen_movies, user_cluster, selected_user).index)
    movie_lists = find_movies(recommendation_listing_ids)
    return movie_lists[:5]

Movie Recs

Written by Data Science Derek (Leckner)