Recommendation Systems: Movie Recommendation

Andrew Dziedzic
Web Mining [IS688, Spring 2022]
4 min readApr 16, 2022

--

Recommendation systems are the process of recommending services or products to users using certain tactics. These recommendation systems are used in a variety of ways, the most common being recommenders for social media platform users, video, and music users, as well as online shopping users. The motivation for the specific recommendation system that was constructed and will be explained further is for an individual/user to see and understand the similar movies based upon an individual’s initial move input. If an individual is inquiring and/or curious about what movies exist that are like a specific movie that user has seen or is interested in watching more movies of the same type immediately, this recommendation system would generate that information. The user would obtain an entire list of similar movies, and how close the movies are with the original movie input that the user entered.

The specific recommendation system that will be used here will be the Content-Based Filtering/Recommender System. Recommendations are developed based on the similarities of the product contents. The content or attributes of the things you like are referred to as “content”. Here, the system uses your features and likes to recommend you with things that you might like. The source of the data is a single raw data csv file from a Kaggle dataset. The Kaggle dataset contains metadata for over 45,000 movies. The dataset consists of movies releases on or before July 2017. Data points include cast, crew, budget, revenue, languages, and countries.

The programming language used for this recommendation system is Python programming language. Specifically, using Jupyter Notebook in an Anaconda environment. The procedure for the recommendation system was the creation of a TF-IDF matrix, then creating the cosine similarity matrix, and lastly, making suggestions based on the similarities.

The below queries will be based on the following three (3) movie titles: Platoon, Shiloh, and Pinocchio. For each specific query, I will return the top ten [10] movies to recommend as well as their similarity scores. The evaluation on whether my recommendation system is performing well is because the movies with the highest cosine similarity scores are being correctly returned in the correct order based on the input movie. The proposed recommendations do in fact look very similar to these 3 input values, as well as other values that have been tested and experimented with.

The main features that were used while programming this recommendation system were pandas, TfidfVectorizer, and cosine_similarity. TfidVectorizer within scikit-learn is used to convert a collection of raw documents to a matrix of TF-IDF features. This is equivalent to CountVectorizer followed by TfidTransformer. Cosine_similarity will calculate cosine similarity between two numpy array. To calculate cosine similarity, we need vectors, hence the usage of the two numpy arrays. Understanding of the initial data frame and making the correct adjustments to the data frame throughout the code were extremely important as well.

As with previous projects, there were limitations of the data, however, with this dataset, I do not see zero to no limitations. The data is very large, and contains many fields, both from a quantitative and qualitative perspective. Their can be many more experiments run for analysis, such as KNN, K-Means, Pearson Correlation, Logistic Regression, Decision Tree, etc.

The main takeaways from this analysis are the recommendations provided that are like the movies Platoon, Shiloh, and Pinocchio. A user will be able to enter in a specific movie title and be provided with a list of top ten (10) recommendations. This recommendation system can be used over time and as the dataset becomes larger, more recommendations can be provided. The top movie recommendation most similar to Platoon is All Quiet on the Western Front. The top movie recommendation most similar to Pinocchio is Pinocchio’s Christmas. The top movie recommendation most similar to Shiloh is Shiloh 2: Shiloh Season.

Please find below sections of Python code used and the corresponding output:

pd.set_option('display.expand_frame_repr', False)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df.head()
Above is the view of the data in Jupyter Notebook
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape
(45466, 75827)indices["Pinocchio"]27686similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
similarity_scores.sort_values(by='score',ascending=False).head(10)
Above is the output of the similarity scores
df['title'][movie_indices
Above is the list of top ten (10) recommendations for the movie: Pinocchio
Above is the list of top ten (10) recommendations for the movie: Platoon
Above is the list of top ten (10) recommendations for the movie: Shiloh

--

--