Using Cosine Similarity Rankings for TV Series Recommendations

Megan Resurreccion
Web Mining [IS688, Spring 2021]
13 min readMar 3, 2021
Source: https://medium.com/@vibesads101/10-recommended-business-tv-shows-for-entrepreneurs-2020-2738496769e8

It’s the weekend and you just finished binge-watching your favorite TV show. You could rewatch it all over again or find something new. But among hundreds of shows on streaming services, how do you pick one? The goal of streaming services is to keep you watching their available programs (and make a profit, of course), but when you have limitless numbers of shows to pick from, how do you decide? There might not be an exact replica of the show you were just watching, but maybe something else just as good will come close. Streaming services likely have their own recommendation systems based on what their users watch, continuously recommending new programs for customers to stream. Recommendation systems are used often among retailers including Amazon (recommending other products when you purchase an item), and social networks (recommending other accounts when you follow with one account).

In this article, I will be discussing how I conducted a similarity ranking of several popular TV shows across four genres and the results. I use Genre, Overview, as well as other information as features for ranking similarity. I also use cosine similarity as the distance metric for this study (more on that later). I’m certainly not the first person to do this kind of study nor will I likely be the last. I largely drew inspiration from Sharma [1] in order to complete this study. Natural Language Processing (NLP) is a huge subfield and there are many ways to analyze texts (and then do similarity rankings on them).

The dataset I will be using is from Kaggle (linked here) and contains data on the top 2000 most popular shows scraped from IMDB. Information that’s included in this dataset for each TV series includes the runtime of each episode, the genre, IMDB rating, years the series aired, an overview of the series, and the top (main) four stars of the series. This dataset is already fairly clean, so I didn’t have to do much data cleaning to use it. For this study, I used Python in a Jupyter Notebook.

Let’s Walk Through The Process

  1. Before doing anything with the dataset, I imported these libraries: pandas and sklearn (also called scikit-learn). These are used to manipulate the dataset as a data frame in any way I need to and compute the similarity ranking.
  2. Once I’ve loaded the dataset, I decided to sort the dataset by the IMDB rating to see what the best-rated shows are. IMDB ratings are a weighted average based on how users on IMDB rate the show. At the top of this list are The Chosen (rating of 9.7), The Filthy Frank Show (rating of 9.5), Breaking Bad (rating of 9.5), Koombiyo (rating of 9.5), and Scam 1992: The Harshad Mehta Story (rating of 9.4).
Data for 3 TV series with the highest ratings.

3. Additionally, I compute some descriptive statistics to gain a sense as to what average, minimum, and maximum ratings are for this dataset. This set of TV series, on average, have a rating of 7.5 with the lowest rating being 1.0 and the highest rating being 9.7.

Mean: 7.5913
Standard Deviation: 0.898731
Minimum: 1.0
25th percentile: 7.2
50th percentile: 7.7
75th percentile: 8.2
Maximum: 9.7

4. The goal is to calculate a similarity ranking for a few TV series, but I needed to actually pick some TV series to do that with. Therefore, I decided to filter out series according to different genres. When doing this, I came to realize that most shows were categorized with more than one genre (they were often categorized into three genres) and filtered all series that contained the given genre. I also looked at the highest-rated shows for each genre and wanted to pick shows from there: the best of each category. I looked at many categories that were available but settled on Action, Horror, Romance, and Sci-Fi. From there, I decided to select one series that I was familiar with from each genre (familiar meaning I’ve seen the series all the way through at least once). I will use these series as my query items for similarity rankings.

  • Action: Avatar: The Last Airbender
  • Horror: Stranger Things
  • Romance: Friends
  • Sci-Fi: Black Mirror

I chose series that I’m familiar with to be able to better evaluate the accuracy of the similarity ranking. It would evidently be more difficult to evaluate the accuracy of TV series recommendations based on programs I've never seen.

Top 3 Rated Action Series. All series are also classified as Adventure.
Top 3 Rated Romance Series. All 3 series are also classified as Comedy.
Top 3 Rated Horror Series. All 3 series are also classified as Drama.
Top 3 Rated Sci-Fi Series. All 3 series are also classified as Drama.

5. Finally time to start calculating the similarity rankings. The features I wanted to use in ranking the four series were genre and overview. I wasn’t sure if one or the other or both all of the given information for a series would be the best method–so I tried them all.

5a. I started simply by looking at just the genres and building similarity rankings based on similar genres. The first step to this was filtering out the series title and the genre columns.

5b. Next was when the sklearn library would finally come into play. From the library, I imported the TfidfVectorizer function. This function tokenizes the terms of a list of strings and creates term frequency vectors [2]. It also removes any English stop words that are in the strings. (This is more necessary for later steps.) The last line of code below lists the resulting matrix dimensions. In this case, it was (2000, 28). 2000 refers to the number of series and 28 refers to the number of genre combinations in the matrix.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
df['Genre']= df['Genre'].fillna('')
df_tfidf = tfidf.fit_transform(df['Genre'])
df_tfidf.shape

5c. I can now apply a distance metric to this new matrix. I use cosine similarity here. The reason I use cosine similarity is primarily that it is a common distance metric for measuring similarity between documents. The cosine similarity algorithm is based on token similarities; it can take into account semantic meanings of words, is computationally efficient, and applicable for any length of text [3].

Without getting too deep into the mathematics of it all, cosine similarity works for documents of any size and measures the cosine angle of two vectors in a multi-dimensional space. Cosine values are between 0 and 1 in which values closer to 0 mean documents are least similar and values closer to 1 mean documents are most similar [4].

from sklearn.metrics.pairwise import cosine_similarity
co_sim=cosine_similarity(df_tfidf,df_tfidf)

5d. Using the cosine similarity scores computed, I finally can obtain a “list of recommendations” for a given TV series. Before doing this, I first create a reverse map of indices and TV series titles [1]. This is so that the titles of series will be outputted when obtaining a list of recommendations, as opposed to index numbers of the original data frame.

indices = pd.Series(df.index, index=df['Series_Title']).drop_duplicates()

5e. Below is the code block that will output a list of series most similar to a given TV series. Much of this was thanks to Sharma [1]. What this set of code does is first obtain the index of the TV series title, obtain the cosine similarity scores associated with that series, sort those scores based on similarity, and output the ten most similar TV series.

def get_recs(title, co_sim=co_sim):
idx = indices[title]
sims = list(enumerate(co_sim[idx]))
sims = sorted(sims, key=lambda x: x[1], reverse=True)
sims = sims[1:11] #index 0 is the inputted series itself
movie_indices = [i[0] for i in sims]
return df['Series_Title'].iloc[movie_indices]
get_recs('Stranger Things')

5f. The outputted results for the given series, Stranger Things, is below in the Similarity Ranking Results section. As previously mentioned at the beginning of step 5, however, I wasn’t sure if these recommendations were the best and most accurate. At a first glance when inputting multiple TV series, it seems that this works. Upon looking up some of the outputted TV series (both in the data frame and online), I wasn’t so sure, which is why I repeated the whole process with different documents.

6. I go through this whole process (aka steps 5a to 5f) again, but instead of using the Genre column of the original data frame, I use Overview. The Overview is a summary of what the series is about and may give some better insight as to what series are similar to each other. The outputted lists produced very different results, some that made sense and others not so much.

7. Hence, I decided to combine the two columns into one and output new lists (as shown below). The outputted lists here resembled that more of the Overview outputted lists than the Genre outputted lists.

df['gen_over']=df['Genre']+' '+df['Overview'] 
#don't forget to add the space in the middle!

8. Still, I wasn’t quite satisfied with what was being outputted and couldn’t help but think if something else was missing. Given some of the other information available in the data, I combined not only the Genre and Overview categories, but also the main four stars, IMDB rating (convert into a string first), and the years the series aired. These outputted lists gave out different results compared to the other three results and are likely the most accurate of the four.

df['all_data']=df['Genre']+' '+df['Overview']+' '+df['Star1']+' '+df['Star2']+' '+df['Star3']+' '+df['Star4']+' '+df['IMDB_Rating']+' '+df['Runtime_of_Series']

Similarity Ranking Results

Below are tables for each of the inputted TV series and their corresponding outputted lists for each trial of cosine similarity rankings. As previously stated, I believe the final trial of cosine similarity rankings was the most accurate. For each of the tables, I will lay out what I observed about each list and what I think made sense and what didn’t.

List of Outputted Recommendations for Avatar: The Last Airbender

Avatar: The Last Airbender (ATLA) (rating of 9.2) is an animated series following a story of elemental magic, action and adventure, and a long arching story of a team of teens saving the world from the Fire Nation (the series’ villain). Many of the series listed in the table are also animated series, rendering many of these recommendations accurate. The plot summaries of Trollhunters and The Dragon Prince also appear similar to that of ATLA, so they are also relevant. The series, The Legend of Korra, is actually a spinoff series to ATLA and should realistically have been listed as having a high similarity score to ATLA. For the most part, many of these series are cartoons featuring characters (usually children or teens) involved in action and adventure. Unexpected series that are here include The Vampire Diaries and Blackadder Goes Forth.

Stranger Things (rating of 8.7) is categorized as horror here but is also science fiction, mystery, and thriller. It explores the disappearance of a boy and a girl with some supernatural psychokinetic abilities in 1980s Indiana. Other involved characters include friends of the boy (who befriend the girl), a police officer looking for the boy with the mom, and the older sister of one of the boys and her romantic interests. The Gates, Grimm, Shadowhunters: The Mortal Instruments, Salem, and Emergence appear among all outputted lists at least twice. Upon further research, these four shows do have some semblance to Stranger Things, featuring supernatural elements and/or the involvement of a police officer or detective as one of the main characters. Unexpected titles that appear here include Leave It to Beaver (a sitcom that aired between 1957 and 1963). The majority of the titles here have horror or supernatural elements in them.

List of Outputted Recommendations for Friends

Friends (rating of 8.9) is a sitcom that aired in the late 90s and early 2000s about six adult friends in New York City and their personal lives, professional lives, and shenanigans. The majority of these titles are also sitcoms (that likely involve romance as well). The series, Joey, is a spinoff to Friends and was able to be captured in the Genre list as well as the final list. One of Cougar Town’s main stars is actually Courtney Cox, who appears on Friends as one of the main stars as well. Master of None and Ally McBeal also appear to be relevant recommendations as they’re both sitcoms about adult characters navigating their lives. Unexpected titles that appear here are Grey’s Anatomy, 6Teen, and The Muppets.

List of Outputted Recommendations for Black Mirror

Black Mirror (rating of 8.8) is an anthology series that I often describe as “the modern-day version of The Twilight Zone” in which unimaginable technology used in our world goes wrong as well as commentary on how we use technology and how technology impacts society. The tech ranges anywhere from robotic bees that can pollinate to having your consciousness transported into a video game. These results, in my opinion, were probably the least accurate of the query items I chose. As this series was chosen as the Sci-Fi series for this study, I recognize that sci-fi can cover many topics in many situations. Black Mirror can have some dark themes and not all sci-fi series will be the same. Among these titles, I believe Electric Dreams and The Outer Limits are probably the best recommendations for Black Mirror. Unexpected titles include Clone High, Duck Dynasty, All American, and Wonder Woman.

Bonus — Color Mapping Distance Matrices

When you have a distance matrix, it’s possible to plot it in Python using a color map. As opposed to looking at the numbers of a distance matrix, we can color code them to better identify patterns and similarities among documents. Plotting colormaps requires the use of libraries, matplotlib and sklearn. More detail on color mapping in Python here. For the plots shown below, squares/lines that are more purple/blue have shorter distances and more green/yellow squares/lines have higher distances. Elements that are on the purple diagonal are in reference to comparing the document with itself. The plot on the left is the distance matrix for all of the documents and the plot on the right is the distance matrix for the first 10 elements (displayed for better understanding the color mapping).

Different distance metrics can be used on a distance matrix aside from cosine similarity including Jaccard similarity and euclidean distance. I wanted to see if I could see anything significant among cosine similarity rankings.

Unfortunately, I didn’t find them particularly useful in my study but have included them here regardless. This is largely due to a large matrix and many elements being involved. Most of the plot is green, meaning the majority of the elements have some distance that is neither very close to 0 (purple) nor very close to 1 (yellow) other than the aforementioned light green line.

import matplotlib.pyplot as plt
plt.imshow(sklearn.metrics.pairwise_distances(co_sim, metric="cosine"))
plt.imshow(sklearn.metrics.pairwise_distances(co_sim[0:10], metric="cosine"))
Left: the colormap for the whole distance matrix | Right: the colormap for the first 10 elements of the distance matrix

Limitations of the Study

  1. There’s something to be said about how some tokens should potentially be weighed more than others. In my final trial of cosine similarity rankings (where I include all information of a TV series in addition to Genre and Overview), I wonder if tokens such as the ones in Genre should be weighted more. Under the first trial (only Genre), there were some series that were accurate recommendations, but some weren’t mentioned in the outputted lists that came after. It seems that all tokens and terms are weighted the same against each other and we can’t put significance on specific ones (to my knowledge, at least).
  2. There’s other information that could be useful to the dataset including but not limited to the full cast list, directors; creators; and producers, and episode titles and episode plot summaries. This information can be implemented into similarity rankings and make recommendations more accurate (or even the opposite).
  3. 2000 TV series seems like a lot, but in reality, there are likely thousands more out there available for streaming. These are just some of the most popular and highly rated ones, but there are definitely some hidden gems worth watching somewhere. There are only so many TV shows that can be similar to an inputted one in the functions.
  4. IMDB ratings may be skewed and may not be an accurate score for how ‘good’ the series is among the general audience. Users may rate the show as being very good or very bad.

Conclusions

In conclusion, I found that including more information to apply to a distance matrix was fairly effective in computing similarity rankings. The cosine similarity distance metric was also efficient for this study as it’s equipped for natural language processing of any size of documents. The results aren’t perfect, but there were still a handful of accurate recommendations of TV series based on the query items I was familiar with. A future study would likely involve weighting some tokens more than others to emphasize the importance of Genre, for example, as tokens are currently weighted the same.

On the other hand, natural language processing can be handled in a variety of ways (that’s why it has its own field) and there’s never a guarantee that what’s recommended for watching on your favorite streaming service is something you’d actually watch. It’s still just a recommendation, based on your previous activity. Similar doesn’t mean the same so you’ll never find something exactly like your favorite show, but that’s what I like about it. We can watch something similar but nothing will come as close to the series we love to binge.

References

[1] https://www.datacamp.com/community/tutorials/recommender-systems-python

[2] https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

[3] https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html

[4] https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a

[5] https://matplotlib.org/stable/tutorials/colors/colormaps.html

--

--

Megan Resurreccion
Web Mining [IS688, Spring 2021]

Hello! I’m a PhD student in Information Systems at NJIT. Feel free to connect with me through LinkedIn! https://www.linkedin.com/in/megan-resurreccion/