Recommendation Systems: Content-Based Filtering

zeynep beyza ayman
7 min readDec 6, 2022

--

Jerry-Lee Bosmans

In this article, I will discuss Content-Based Filtering, the second title of Recommendation Systems. You can reach the first article I wrote on the Recommendation Systems which is Association Rule Learning from the following link:

Content-Based Filtering is one of the methods used as a Recommendation System. Similarities are calculated over product metadata, and it provides the opportunity to develop recommendations. The products that are most similar to the relevant product are recommended.

Metadata represents the features of a product/service. For example, the director, cast, and screenwriter of a movie; the author, back cover article, translator of a book, or category information of a product.

This image contains the description of the movie that a user liked. To suggest a movie to the user based on the movie they like, a mathematical form should be obtained by using these descriptions, that is, the text should be made measurable and then similar descriptions should be found by comparing them with other movies.

We have various movies and explanations about these movies. To be able to compare these movie annotations, the annotations need to be vectorized. When vectorizing these descriptions, a matrix of unique words in all movie descriptions (let’s say n) and all movies (let’s say m) must be created. There are all unique words in the columns, all the movies in the rows, and how many of each word is used in which movie at the intersections. In this way, texts can be vectorized.

Steps of Content-Based Filtering:

  1. Represent Texts Mathematically (Text Vectorization):
  • Count Vector
  • TF-IDF

2. Calculate Similarities

1. Text Vectorization

Text vectorization is one of the most important subjects based on Text Processing, Text Mining, and Natural Language Processing. Approaches such as converting texts to vectors and calculating similarity-distance on them form the basis of the analytical world. If the texts can be represented by vectors, then mathematical operations can be performed.

The 2 common ways to represent text as vectors are Count Vector and TF-IDF.

- Count Vector:

  • Step 1: All unique terms are placed in columns, and all documents are in rows.
  • Step 2: The frequency of the terms in the documents are placed in the cells at the intersection

- TF-IDF:

TF-IDF performs a normalization process over the frequency of the words both in their texts and in the whole corpus, that is, in the data we focus on. In other words, it makes a general standardization of the word vectors we will create, taking into account the document term matrix, the whole corpus, all the documents, and the frequency of the terms. In this way, it eliminates some of the biases that may occur due to the Count Vector.

  • Step 1: Calculate the Count Vectorizer (frequency of each word in each document)
  • Step 2: Calculate TF (Term Frequency)

(frequency of term t in the related document) / (total number of terms in the document)

  • Step 3: Calculate IDF (Inverse Document Frequency)

1 + loge((number of documents + 1) / (number of documents with term t in it+ 1))

Total number of documents of the sample examined: 4

If a term t has a high frequency of observation in the whole corpus, it means that this related term affects the whole corpus. In this case, a normalization is made over the passing frequencies both within the terms and in the whole corpus.

  • Step 4: Calculate TF * IDF
  • Step 5: L2 Normalization

Find the square root of the sum of the squares of the rows, and divide the corresponding cells by the value that is found.

L2 Normalization corrects again for the words that could not show their effect due to the presence of missing values in some lines.

2. Calculate Similarities

Suppose we have m movies, and n unique words in the description of those movies. Before we find the content-based similarity of these films programmatically, let’s see how we can do it practically:

We can use euclidean distances or cosine similarity to find the similarity of vectorized films.

- Euclidean Distance:

By calculating the Euclidean distances, the distance value between the two films, which expresses the similarity between the films, can be found. It is observed that as the distance decreases, the similarity increases. In this way, the recommendation process can be carried out.

- Cosine Similarity:

While there is the concept of distance in Euclid, the concept of similarity appears in Cosine Similarity. Distance-closeness, and similarity-dissimilarity correspond to the same concepts here.

PROJECT

Now that we’ve covered the logic of content-based filtering, we can dive into the project.

Problem:

A newly established online movie platform wants to make movie recommendations to its users. Since the login rate of users is very low, the habits of the users are unknown. however, the information about which movies the users watch can be accessed from the traces in the browser. According to this information, it is desired to make movie recommendations to the users.

About Dataset:

The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

You can reach the data set here.

Creating TF-ID Matrix:

The necessary libraries were imported at the beginning of the project and the data set was read.

The first process to be applied here is to use the TF-IDF method. For this, the TfidfVectorizer method imported at the beginning of the project is called. The stop_words=’english’ parameter is entered to delete words (and, the, at, on, etc.) that are commonly used in the language and do not carry a measurement value. The reason for this is to avoid the problem that sparse values will cause in the TF-IDF matrix to be created.

The shape of the tfidf_matrix is (45466, 75827) where 45466 represents the number of overviews and 75827 represents the number of unique words. To be able to progress better when working with data of this size, I will convert the type of the values at the intersection of the tfidf_matrix to float32 and proceed accordingly.

Now that we have the scores at the intersection of the tfidf_matrix, we can now construct the cosine similarity matrix and observe the similarity between the films.

Creating Cosine Similarity Matrix:

Using the cosine_similarity method imported at the beginning of the project, the similarity values of each movie are found with the other movies.

For example, we can find the similarity scores of the movie in the 1st index with all the other movies as follows:

Making Suggestions Based on Similarities:

Similarities were calculated with Cosine Similarity, but the names of the movies are needed to evaluate these scores. For this, a pandas series containing which movie is in which index is created asindices = pd.Series(df.index, index=df[‘title’]) .

As can be seen below, multiplexing is observed in some movies.

We need to keep one of these multiples and eliminate the rest, the one should take the most recent one of these multiples on the most recent date. This can be accomplished as follows:

As a result of the operations, it can be observed that each title becomes singular and becomes accessible with a single index information.

Let’s assume we want to reach 10 movies similar to Sherlock Holmes. First of all, the Sherlock Holmes movie is selected by entering the index information of Sherlock Holmes in cosine_sim, and the scores expressing the similarity relationship between this movie and other movies are accessed.

A data frame named similarity_scores is created to be in a more readable format. The selected similarities with cosine_sim[movie_index] are saved as the ‘score’ variable in this data frame.

The index of the 10 movies most similar to the Sherlock movie has been selected above. The movie names corresponding to these indexes can be accessed as follows:

These 10 movies are the most similar to Sherlock Holmes in terms of their descriptions. These movies can be recommended to the user who watched Sherlock. You can also try different movies and observe the results.

--

--