# Exploring Movie Recommender Systems, part 2: Content-Based

Using movie metadata to find thematically-linked films with natural language processing.

Click here to read part 1, which deals with popularity recommenders.

Have you ever browsed a website that’s recommended something to you? Whether it’s Facebook suggesting you add someone as a friend, Amazon proclaiming that they think you’ll enjoy a specific product, or Hulu suggesting a new TV show, recommender systems are widely used by websites to drive traffic and sales.

In a previous post, I detailed why I chose to make a movie recommender. The short and sweet version is that I wanted to recommend myself movies, but didn’t want to be limited by one site’s catalogue, à la Netflix. Using the 27,000 movies included in the MovieLens 20M dataset from GroupLens as a starting point, I got the entire list of movie IDs from IMDB’s available datasets and then used OMDBAPI to scrape the metadata for over 250,000 movies.

# The Process

At its simplest, a content-based recommender works like the image to the left. A user likes movie A and wants to find a similar movie. Movie A is similar to movie B, and movie B is recommended to the user.

But how do we find something *similar*? What does similar mean, and how can someone turn that into an automated process? As a person, I know that *Toy Story* and *Brave* are both animated movies created by Pixar, and therefore somewhat similar. I also know that the 1988 movie *Child’s Play* is about a doll named Chucky - given to a boy named Andy - who comes to life. *Toy Story *is also a movie about a boy named Andy whose toys come to life. But while *Toy Story* is an animated family film suitable for all ages, *Child’s Play *is described as a ‘supernatural slasher film’. If I were looking for movies similar to *Toy Story*, *Child’s Play *is not the result I would want.

So how is this problem solved? The answer is through the use of natural language processing (NLP). NLP, in brief, helps computers understand, interpret, and — in some cases — manipulate human language.

First, though, the metadata has to be processed. Here’s how I did it:

# Text Processing and Cleaning

The first issue that I encountered early on with my recommender system was that when you have 250,000+ movies, some of those movies have the same title. This could be due to remakes (Alice in Wonderland is a 1951 movie as well as a 2010 movie). Thus, the very first thing I did was change the titles of the movie to the title as well as the year, changing ‘Toy Story’ to ‘Toy Story (1995)’.

I knew as well that I wanted to return quality movies, and not necessarily just the movies that were ‘closest’ to the movie I was searching. Because of this, I implemented a weighted rating using the total number of votes the movie had received on IMDB and its average IMDB score using the following equation:

(𝑊𝑅)=(𝑣÷(𝑣+𝑚))×𝑅+(𝑚÷(𝑣+𝑚))×𝐶

Where:

*R* = average rating for the movie*v* = the number of votes cast for the movie*m* = the minimum vote threshold required*C* = the mean rating of **all** movies in the dataset

Next, I had to decide on which features I wanted to keep. I decided on the following features:

- The release year
- The MPAA rating (if available)
- The genres
- The director
- The writer(s)
- Notable actors
- The plot
- The country of production
- Languages used within the movie

Another issue was at hand: John Lasseter directed several Pixar movies. John Waters is a director known, according to Wikipedia, for his transgressive cult films like *Pink Flamingos. *John Wayne was an American actor who mostly performed in Westerns. However, because the techniques I used within NLP, all three of those instances of ‘John’ (along with every other similar name) would skew how ‘close’ those movies were together, due to the frequency of the word John. So I deleted the spaces from their names, making John Lasseter into johnlasseter, John Waters into johnwaters, and John Wayne into johnwayne.

An important note on this: these vectors threw a theoretical wrench in the gears when it came to trying to implement more advanced NLP like doc2vec or word2vec, which work by using a shallow neural net to attempt to give you a numerical representation of each word that can also encompass relations like Queen → woman and King → man. These relationships are called embeddings, and the name vectors on their own didn’t provide useful embedding information. I likely would have had to go through and create embeddings of my own, such as linking johnlasseter → pixar, or johnwayne → westerns.

As it is, I used a simple bag of words (BoW) method. This method is pretty simplistic — it doesn’t account for grammar or context. But since a lot of what I needed didn’t necessarily require context, I wanted to see how it would do.

A given movie would then have info that originally looked like this:

I went through and manually tokenized all of the information. I took out any information between parenthesis in the ‘writer’ section, made everything lowercase, and removed the word ‘nan’ universally.

For ‘plot’, which was the only area of text that required cleaning, I used NLTK-RAKE — short for ‘Rapid Automatic Keyword Extraction’. It works by trying to determine key phrases in the document. It also filters out stopwords — or common words in English that occur too frequently to have any real meaning — such as ‘a’, ‘an’, ‘the’, ‘she’, etc. All in all, after tokenizing everything, I was left with something that looked like this for every movie:

Which I could then put together into a bag of words, and it would become a long, grammatically incorrect sentence:

# Determining ‘Distance’

Once we get into the mathematical concept of ‘distance’, it becomes important to talk about what distance means. As its most simple, we usually consider euclidean distance — the distance in 2D or 3D space from point A to point B. While it is possible to use euclidean distance in machine learning (such as in K-nearest neighbors algorithms), in this case I’ll be using **cosine similarity** as the distance function.

Cosine similarity is a mathematical formula that spits out a number between -1 and 1. If you think of a two-dimensional plane with vectors, a cosine similarity value of 1 indicates that the vectors are exactly the same. A cosine similarity value of 0 indicates the vectors are orthogonal (in 2D space, we call this perpendicular). A value of -1 indicates an opposite orientation, meaning the vectors are pointing in different directions.

In my project, I only got values between 0 and 1. A value of -1 would not be useful, as an opposite orientation doesn’t give much important information. For example, does it mean *opposite *or does it just mean *not correlated*? It’s easier to determine cosine similarity looking simply at how orthogonal the two vectors are in n-dimensional space.

Cosine similarity is calculated by taking the dot product of the two vectors over the product of the magnitude (length) of the vectors, thereby normalizing it in terms of vector length. If that’s confusing, you can learn more about dot products here.

# Implementation — Count Vectorization

The first way to do this is to create a **count vector **and then use cosine similarity to determine how similar the vectors are.

Count vectors turn documents into very, very long vectors that contain only positive integers. All together, a dataframe of count vectors will create a huge sparse matrix, where the majority of the matrix is taken up by the number 0.

However, we can take these vectors and use cosine similarity to determine their distance, using our cosine similarity formula:

The way that I did this was all in one calculation. First, I vectorized the metadata into count vectors. Then I ran the cosine similarity of all the movies, creating a giant sparse matrix of cosine similarities.

**This turned out to be a bit of a problem** when it came to optimization and scalability. My total dataset had 250,000 movies. If I had used all of the movies available in my dataset, I would have ended up with a 250k x 250k matrix, which hosts 62,500,000,000 cells. That’s sixty two billion.

At the time of my project, I didn’t know how to solve this, so I once again scaled down my data to 52,000-something movies. This still gave me a matrix with 2,704,000,000 cells. But my computer could handle it.

# The Results

The final step was to write a function that sorted by cosine similarity scores for a called movie and then ordered them. I made it so ‘users’ could sort either by pure cosine similarity or by the previously-implemented weighted rating. That way, they could get ‘certified good’ movies (high weighted ratings) or just choose to get the pure cosine similarity results.

When I ran the function, I found something interesting had happened: the algorithm was picking up on similar movie *themes. *While I’m not sure all the movies would have been appreciated by those using the theoretical website, it was interesting to see. Let’s take a look at *The Lord of The Rings: The Fellowship of the Ring *results:

Looking at the above results, something interesting has happened. We get movies we might have expected: the other two movies in the LoTR franchise, as well as *The Hobbit*. *Harry Potter* is in there (but only the one that heavily features ‘war’), which makes sense. But what about the other movies? As it turns out, all of the movies listed have similar themes. They all feature other worlds — both in the case of Tolkien or Rowling’s fantasy worlds, the world of *Thor*, or places like the afterlife. All the movies feature themes such as power attracting the corruptible, inevitable loss, and emotional struggle in the face of sacrifice and how its worth it anyway.

*Lord of the Rings *might be an imperfect example, anyway: its style and genre are fairly unique, leading to few movies that can capture the same essential feelings. As humans, we can easily think about the nuance of the word *drama *as it might be applied to LoTR vs a drama like *The King’s Speech.*

I think this could be improved upon with user input. For example, use of genres or keywords could easily be user-inputted and voted upon instead of using the movie’s straight metadata, which has to conform to industry standards.

Here are the results for *Spirited Away*, the 2001 acclaimed movie by Japanese animator Hayao Miyazaki:

Mostly we get other movies by Studio Ghibli and Miyazaki. At the bottom, we do get *The Rugrats Go Wild*, which again, does have a similar theme: children getting lost from their parents and discovering fantastical worlds.

These results specifically could also arise out of the fact that many of Miyazaki’s movies feature the same themes: strong female leads who often face difficulty or are separated from their parents or family. They’re all coming of age stories in their own ways.

*The Rugrats Go Wild* also gave me peace of mind for one reason: it did show that the algorithm was pulling movies ‘across the aisle’, so to speak. In that the *only* results for *Spirited Away *were not all Miyazaki or Studio Ghibli movies. I did consider how to improve this, and I think the answer lies again in embeddings. It would have been smart to somehow link Studio Ghibli to Pixar (the American distributor) and Pixar to Disney, perhaps then widening the available pools of ‘similar’ films.

And finally, for a more obscure movie, let’s look at the 1987 German film *Der Himmel über Berlin *(English: Wings of Desire):

Again, we get thematically-similar movies.* Faraway,* S*o Close!* is thematically identical. In terms of general themes, many of the movies deal with individual choices only mattering to the individual and dabble in the same sense of melancholy. Seeing as the movie I picked is a German homage art film directed by Wim Wenders that’s basically a think piece with a guest role by Peter Falk who is playing an angel pretending to be Peter Falk playing Columbo, shot entirely in black and white…these results do not disappoint!

# The Issue of Scaling

As previously mentioned, the amount of data here is an issue. Cosine similarity doesn’t scale well: as the size of the matrix increases, it requires much more space and memory than any reasonable computer can offer. But how do we address it?

One way that I found was to use **term frequency-inverse document frequency (tf-idf)** vector instead of count vectors. Due to the nature of these, the resultant vectors are *already normalized*. This means that to calculate ‘cosine similarity’, we really are just calculating the dot product of the two vectors. With this, I was able to use sklearn’s pairwise_kernels to give the same information as the count vector paired with pure cosine similarity.

It is of note that this method was much less effective when using an incredibly-scaled down dataset (~25,000 movies). In this case, the results appeared to be random. When using a dataset of the same size as the above process (~52,000 movies), I ran into the issue of R rated movies being recommended for G rated movies. This issue was easily solved by weighting G and PG movies (including the rating G or PG two or three times in the bag of words instead of just once).

The solution to scalability might lie in using Locality-Sensitive Hashing, which uses another measure of distance: the Pearson’s Correlation Coefficient.

# Evaluation and Wrap Up

Overall, the recommender performs well. This, coupled with the popularity charts from my previous blog (read it here if you haven’t already) is enough to start a movie recommendation website. Though it’s important to note that while these recommender results are decent, they still aren’t personalized. Join me next time to learn about building a collaborative recommender system, that predicts what a user would rate a certain movie.

Thanks for reading! Leave any questions or comments down below, and see you next time!