The Startup
Published in

The Startup

Recommendation System for Movies — MovieLens | Grouplens

Source: Thibault Penin

Movie Ratings Database

All the files in the MovieLens 25M Dataset file; extracted/unzipped on July 2020.

Warning!

What happens to your computer when large Pandas DataFrames merges together…

Data Wrangling

MOVIES.CSV

In this image of the processed movies.csv, I did extract the year from the title and do a bit of statistics for the ratings given for each movie as well as finding the number of times the movie was rated. However, this information was not utilized in the machine learning steps. For the genre columns, “0” means that the movie was not categorized in that genre and “1” meant that the movie was under that genre.

TAGS.CSV

“so bad it’s good” is now gone from the filtered tags.

RATINGS.CSV: Defining Like and Dislike

Feature Engineering: Who Liked What and Disliked What?

Genres Model: Scaling Genres Interests

The top image is the like genre profiles and the bottom image is the dislike genre profiles for a few users.

Tags Model: Phrase Vectorization

A common tag was considered to be a tag assigned 35 or more times to any movie.
The DataFrame contains the top 20 tags with the most counts for the movies each user rated — for both like and disliked movies, separately. The values in the cells/elements are the tag vectors with the most liked/disliked tags at position “X_0” and descending as the numbers move towards “X_19”. (There were too many columns to fit in the image above, but the naming scheme is the same for “LIKE_X” and “DISLIKE_X” columns.)
Like the user tag profiles, the movie tag profiles only consider the top five tags. This is due to many movies not having enough tags to consider more than five tags.

Model Training

Genres Model: Neural Network/Deep Learning

Implementing a decaying learning rate since the output is in the range of 0–1 (the loss would be small as a result and less intuitively informative). Validation wasn’t necessary but implemented for a “academic feel”.

Tags Model: Random Forest

WARNING!

Combined Model: Linear Regression

Results: Statistical Analysis of Models

Top 10 Movie Recommendations: User 6550

The top 10 movie recommendations for user 6550.
The genres liked (upper row) and disliked (lower row) of user 6550.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store