Movie Watcher generated with Disco Diffusion

How AI recommends movies for you — a look under the hood utilizing TF-IDF and Cosine Similarity.

Iva @ Tesla Institute
Artificialis

--

In this blog post, we will explore the use of two powerful natural language processing techniques, TF-IDF and Cosine Similarity, to build a movie recommendation system using the IMDB dataset. We will first discuss the concept of TF-IDF, which stands for Term Frequency-Inverse Document Frequency, and how it is used to represent text data in a numerical format. Next, we will delve into the concept of Cosine Similarity, a measure of similarity between two non-zero vectors, and how it can be used to compare the similarity of two pieces of text. Finally, we will put these techniques into practice by building a movie recommendation system that suggests similar movies to a given input movie based on their plot descriptions.

In the first part of the experiment, we will use a dataset from Kaggle, you can grab the link here:

On this dataset, we will test the TF-IDF algorithm and Cosine Similarity, and if it works we will scrape more data from the IMDb site and test our Recommendation System.

First, let’s import the modules we will use:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Let’s get and explore CSV file from the Kaggle:

data = pd.read_csv('/content/IMDB WITH BERT/imdb_top_1000.csv')
data.head()

The result is packed in DataFrame and should look like this after data cleaning which consists of reducing the columns we don’t need:

we are keeping just Genre, Overview and Series_Title
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

data['Overview'] = data['Overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(data['Overview'])

tfidf_matrix.shape

This code imports the TfidfVectorizer class from the sklearn.feature_extraction.text module.

It then creates an instance of the class called tfidf and sets the stop_words parameter to ‘english’, which tells the vectorizer to ignore common English words that contain little meaningful information (e.g. “the”, “and”, “is”).

It then replaces any missing values in the ‘Overview’ column of the ‘data’ data frame with an empty string.

It then calls the fit_transform method on the tfidf object, passing in the ‘Overview’ column of the ‘data’ Data frame as the input. This creates a sparse matrix representation of the input text, where each row represents a document (in this case, an overview of a series) and each column represents a word. Each element in the matrix represents the tf-idf value of that word in that document.

The shape attribute of the resulting matrix is then printed, which returns a tuple of the number of rows and columns in the matrix.

In summary, this code is using the Tf-idf Vectorizer to create a sparse matrix representation of the text in the ‘Overview’ column of the ‘data’ dataframe where each row represents a document and each column represents a word. Tf-idf is a measure of the importance of a word in a document, and is typically used to extract features from the text for use in natural language processing and machine learning tasks.

Next, we will use Sckit-Learn to perform Cosine Similarity:

from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

This code imports the linear_kernel function from the sklearn.metrics.pairwise module.

It then uses this function to compute the cosine similarity between all the rows of a matrix called tfidf_matrix and assigns the result to a variable called cosine_sim.

linear_kernel is a function that computes the dot product of two matrices. In this case, it is computing the dot product of tfidf_matrix with itself. Because the dot product of a matrix with itself is equivalent to taking the dot product of each row with every other row, this computation is effectively computing the similarity between every pair of rows in the tfidf_matrix.

The result is a square matrix with the same number of rows and columns as the input matrix, where the element at position (i, j) represents the similarity between the i-th and j-th row in the input matrix.

In this case, since the tfidf_matrix is passed twice, the result is a square matrix with the same number of rows and columns as the input matrix, where the element at position (i, j) represents the cosine similarity between the i-th and j-th row in the input matrix.

Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. So, a cosine similarity of 1 means that the vectors have the same direction (or angle), and a similarity of 0 means that the vectors are orthogonal (90° angle).

scheme of dot products for finding cosine similarity

Let’s remove duplicates from Series Title:

indices = pd.Series(data.index, index=data['Series_Title']).drop_duplicates()

And now let’s explain the recommendation function:

def recommendations(series_title, cosine_sim=cosine_sim):

idx = indices[series_title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]

movie_indices = [i[0] for i in sim_scores]


return data['Series_Title'].iloc[movie_indices]

This is a function called recommendations which takes in two arguments:

series_title: which is a string that represents the title of a series for which the function will generate recommendations. cosine_sim: which is a cosine similarity matrix that is used to measure the similarity between different series. The function starts by finding the index of the input series in the indices dictionary. Then it creates a list of tuples where each tuple contains the index of a series and its similarity score to the input series. The list of tuples is sorted in descending order based on the similarity scores. We then sliced this list to keep only the top 10 most similar series (excluding the input series itself).

Next, the function creates a list of indices of the top 10 most similar series by extracting the first element of each tuple from the sliced list.

Finally, the function returns the ‘Series_Title’ column of the data data frame for the indices above.

So, overall, this function is returning a list of series titles that are most similar to the input series title based on the cosine similarity score.

the recommendation result based on the movie ‘The Godfather’

Now let’s build a scrapper. I found a great method for how this can be done in this article:

Here’s a function that will pack our scrapped data into a data frame:

# Call the download function with the array of URLS called imageArr
download_stories(imageArr)

# Attach all the data to the pandas dataframe. You can optionally write it to a CSV file as well
movieDf = pd.DataFrame({
"Title": movie_title_arr,
"Release_Year": movie_year_arr,
"Genre": movie_genre_arr,
"Synopsis": movie_synopsis_arr,
"image_url": image_url_arr,
"image_id": image_id_arr,
})

print('--------- Download Complete CSV Formed --------')

# movie.to_csv('file.csv', index=False) : If you want to store the file.
movieDf.head()

Moment of truth! Let’s experiment now if our algorithm is going to work on data from the wild:

result

Quite neat, right? We got recommendations for 10 new movies based on the movie we picked!

If you want to replicate the experiment, here’s the Colab Notebook so you can try this out for yourself:

If you liked the article, follow me and subscribe for more content like this

Cheers!

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝