Movie Recommendation System
Creating a content based movie recommendation system.
- Introduction: Movies have always been a substantial part of entertainment in our history and specially in this current world which is amidst a pandemic, movies have played a large role by helping people to relax their mind. But when you watch a movie the next big question is “WHAT NEXT?” it is the most confusing question that almost everyone has and to solve this problem I have created this project. In this project we are going to create a movie recommendation system, based on content the user watches. This model will use content based filtering method for giving the recommended movies to the user and tell them about the similar movies according to their respective preferences.
- Types of filtering techniques used : When making a recommendation engine you have to choose a filtering technique to categorize your data for the prediction. There are majorly two main filtering techniques used while making a recommendation engine:
2.1 Collaborative Filtering : Suppose there are two similar users U1 and U2. U1 buys an I phone and along with it he/she buys an earphone, now U2 buys an I phone so U2 would also be recommended the same earphones. To sum up Collaborative filtering is a technique that can filter out items which a user might like on the basis of reactions by similar users.
2.2 Content Based Filtering: Suppose there are two users U1 and U2 and User 1 has watched Movies M1(Action),M2(Adventurous) and M3(Action) and rated them 5 stars, 4 stars and 3 stars respectively. Now let us suppose U2 has watched Movies M4(Action), M2(Adventurous), U2 will be recommended movie M1 which is an action movie with the highest rating. To sum up: Content based filtering as a system that seeks to predict the “rating” or “preference” a user would give to an item.
3. Importing the Libraries and data set: We will import libraries Pandas and Numpy. Pandas is used for data manipulation and analysis, and Numpy is used for adding support to large multidimensional arrays.
DATASET: I imported the dataset from Kaggle https://www.kaggle.com/tmdb/tmdb-movie-metadata . As shown in the image there are two datasets : 1. Credits and 2. Movies
4. Merging the datasets:
If we look at both the datasets we can clearly see the that “Movies_id”column on credits dataset is same as “id” column on Movies dataset and have the same values, so we can rename the column “Movie_id” as “id” in the credits dataset and merge both the datasets on the column “id” as shown in the image 2
5. Data Cleaning : Drop all the columns of the dataset which are not useful for prediction, which includes almost every column except “Overview”, “orignal_title”, “Id”, “Genres”, and “orignal_language”.
6. Creating vector of matrix: We will now use the overview column(Summary of plot) to pickup the keywords in order to recommend the user, of movies having similar plots. Overview column will have the content which we will be extracting to make the recommendations. When creating a recommendation engine, it is necessary to create vector of matrix for each movie. In this case we will do it using tfidfvectorizer .
6.1 Tfidfvectorizer : It is a NLP concept, which is used to convert text to vectors. We will import it from sklearn.feature_extraction.text . It will create document matrix of this column. This function has 3 main features, which are :
- ngram_range: This feature will help the model to group 1–3 similar words of the overview column.
- stop_words = “english”: this features will remove all the repetitive words like pronouns, articles etc.
- strip_accents, token_pattern, analyzer : these features will help getting rid of punctuation marks and all the conjunctions from the column.
6.2 Treating NaN values : Alot of nan values will be there in the overview column due to the aforementioned steps, which we will replace with the blank values using .fillna(‘ ’) .
7. Converting it into a sparse matrix : We will now convert this column into sparse matrix using the fit_transform function. Sparce matrix is a matrix which has alot of zero values, and some non zero values . The non zero values in this matrix will be given because of the tfidfvectorizer, using an equation . The value of all the non zero terms would be between between 0 and 1.
On seeing the shape of the matrix we observe that there are more than 4500 records and more than 10000 columns which are combination of words, that are created using ngram_range.
8. Finding similarity between different movies: We will import a library known as sigmoid_kernel from sklearn.metrics.pairwise. This library basically converts an input value into a sigmoid function. A sigmoid function is a function which has its value between 0 and 1. It is converted a by simple formula given in the figure below
8.1 Applying Sigmoid : It is used to see similarity of an overview/summary of one movie with respect to the overview/summary of another movie and when we pass it through a sigmoid we will see a similar value between 0 and 1, higher the value more will be the similarity. So in the code when applying the sigmoid kernel we will have to give the same matrix in order to get the similarities between different movies. The similarity will be calculated based on the vector values. As shown in image 6 , here sig[0] represents similarity of overview 1 with respect to overview of all the other movies.
9. Creating Indices: We will create indices of all the movies in the dataset and drop all the duplicate titles, this will give us a unique index value for every movie title in the dataset, which will be very useful in the upcoming part of the code
10. Getting recommendations : We have a function over here which will predict all the similar movies, we will now know how it works. This function will take the movie title, from this movie title the model will find it’s index value(Using Indices). It will then be passed through a sigmoid object which will give a range of values,the model will convert the values into a list using the list(enumerate(sig[])) attribute. The list will then be arranged in descending order using the sorted() attribute. Then we will pick up the top 7 similarity scores, and using the movie_indices we will pickup the original title of the movie.
11. Result: Top 7 movies with the highest similarity based on our model will be displayed.
Source Code : https://github.com/vishnavchhabra/content-based-movie-reccomendation