Ranking Movies by Similarity

Published in

Web Mining [IS688, Spring 2021]

6 min readApr 24, 2021

Movies have become very popular in the last century. It is the favorite pastime of many people across the world. The best thing about movies is that there are different genres. This means that everyone can find something interesting. Even though many movies are produced every year, there are some that are more popular than others. Movies have a part in influencing the masses and so are a great way of spreading information about things that affect the world.

I have always been intrigued by how google and other search engines shows movies that are similar to my search. Let’s take an example to this:

Google Search engine for my Movie search.

When I searched for “Extraction” on google along with that it even showed me the list of movies with similar genre, actors, plots, directors and the list goes on. Isn’t it amazing how our search is narrowed down by showing us only those things similar to our search?

Another such example would be Amazon, majority of their sales are achieved by providing similar context to their users based on their search.

In this article, I shall be discussing about how similarity functions in python is used to determine the similarities between movies. Before we begin lets define what similarity function is and how it works.

Similarity Functions:

Similarity functions are used to measure the ‘distance’ between two vectors or numbers or pairs. Its a measure of how similar the two objects being measured are. The two objects are deemed to be similar if the distance between them is small, and vice-versa. Measures of similarity are:

Eucledian Distance: Used to measure dense or continuous data.
Manhattan Distance: Absolute sum of difference between the x-coordinates and y-coordinates.
Minkowski Distance: Generalized metric form of Euclidean distance and Manhattan distance
Cosine Distance: Determines the normalized dot product of the two attributes.
Jaccard Similarity: Used to find similarities between sets.

We will be using cosine similarity metrics to determine the similarity between movies.

Cosine Similarity Metrics:

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively trying to find cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

The below code determines an example of how a cosine function is determined in python.

Data Source:

We use data directly available on “Grouplens Research” website. The data downloaded is stored in the .csv format and is read in python using panda library. Lets have a glance at our data:

Some of the attributes in this dataset are index, genre, name, language, url, id and so on. If we examine the data, there is a lot of extra information provided about the movie which we do not need and hence we eliminate most of them and just use keywords, cast, genres and director column.

Data Cleaning and Pre-Processing:

It is important to address the NaN values in our dataset to avoid any errors. We will fill all the NaN values with the blank string in the data set. We process the data by defining a function to combine all the feature(keyword, cast, genres, director) to further define the similarity. This function is applied for each row of the data base.

Scikit-Learn: Machine Learning Software Library

Wikipedia: Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Scikit-Learn(sklearn) is a simple and efficient machine learning tools to perform predictive analysis. To identify the similarities between the movies we will import CounterVectorizer and Cosine Similarity libraries from the sklearn software.

CounterVectorizer():

Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

code for vectorizing the data and obtaining count matrix

After obtaining the count we apply the cosine similarity explained earlier to this matrix.

“Cosine = cosine_similarity(count_matrix)”

When we observe our dataset, we notice that every title is assigned to a particular index. Hence, it is very important to get movie title from the index number and index number from the movie title. Thus, we define two functions in python to get the relationship between title and the index.

Functions to define index and title of the movie.

Next step towards our analysis will be to get the title of the movie and the corresponding index.

Now, we will access the row corresponding to “The Dark Knight Rises” and obtain the similarity score of other movies.

similar_movies = list(enumerate(Cosine[movie_index]))

The above code will enumerate the similarity giving us a tuple of movie index and similarity score. We need to sort the similar movies in descending order based on the similarity score since the most similar movie to the current movie will be the movie itself.

Let’s obtain the top 10 movies similar to “THE DARK KNIGHT RISES”.

Conclusion:

The above article defines how similarity metrics can be used to determine similarity between movies based on various features. We can even perform “Natural Language Processing”(NLP) on our data and can obtain similarity with the plots by performing text analysis.

This similarity metrics can be used to in various fields such as finding similar songs, books etc. Various companies such as Amazon, Netflix, YouTube use this feature. This feature can further be expanded by creating a recommendation engine for web based platform so that the companies can provide a better service to its users.

Github:

rohit131991/Movies-Similarity

Ranking movies based on their similarity. Contribute to rohit131991/Movies-Similarity development by creating an…

github.com