Building A Recommendation System Using Disney+ TV Shows and Films
In 2019, Disney released their own streaming service, staking their claim as an active competitor in the streaming service industry. In addition to streaming services people already had, such as Netflix or Hulu, Disney+ became another that many have subscribed to. Disney has many franchises many love from their classic Disney films, Disney Channel original movies and shows, Pixar, Marvel, Star Wars, and National Geographic. Like other streaming services, Disney+ has a recommendation system built-in for recommending films to users based on something they watch. If a user watches a Pixar film, another Pixar film may be recommended after streaming. If a show about superheroes is viewed, another related tv show may be recommended soon after.
The goal of streaming services is to keep users streaming whatever programs that service provides, so similar shows and films are vital to the user. Recommendation systems are very much in use all around the web from online retailers like Amazon for recommending items for purchase, social media sites like Twitter for recommending users to follow, and music streaming services like Spotify for recommending artists and songs to listen to.
In this article, I will be discussing my attempt to build a basic content-based recommendation system for films and shows from a dataset of films and shows available on Disney+. The dataset I used for this recommendation system can be found here.
The Process
1. I first prepared a Jupyter notebook where I loaded the dataset, and included the following libraries: pandas, sklearn, and nltk. From the sklearn library, I also imported the cosine_similarity and CountVectorizer functions, and from nltk, I imported the stopwords list. From the loaded dataset, I removed several rows I wasn’t planning on using including IMDB ID, awards, language, runtime, and release date. A screenshot of the first few rows of the dataset is below in Figure 1. The dataset features 992 rows (programs) which seem fairly small for a streaming service, I believe. (I’m assuming Netflix and Hulu have many more options.) Since Disney+ is still a fairly new streaming service, I excused the small dataset.
2. The information I intended to use for a content-based recommendation system are the plot, genre, director, writer, and actors. To use these columns of information, it was necessary to clean them and make that data usable. I started with the plot column. To clean the plot summaries of the programs, I lower-cased the strings of each summary and then removed any stopwords (using nltk) to ensure that only the most important keywords stay in the plot summary. There were oftentimes character names, locations, or other essential keywords that were in most plot summaries that helped differentiate the programs. A screenshot of what the resulting data looked like is in Figure 2.
3. The next step was to clean the genre, director, writer, and actors categories. This was also fairly simple as I also lower-cased each term, replaced all commas with a space, and then append these features into a new column. I then needed to combine the plot and features columns into a new column so that all the necessary information could be together in one place.
4. After this, I used the CountVectorizer function to transform the column into a count matrix so that I could later use cosine similarity as the distance metric to obtain similar programs to an inputted one. The shape of the count matrix ended up being (992, 7095). The reason I use the Count Vectorizer as opposed to a TF-IDF Vectorizer is that the latter considers the overall weight of a term in the document. Since I’ve already extracted the most important terms from each category, I believed a Count Vectorizer to count frequent terms without weighing them would be more efficient and accurate.
5. Using the cosine similarity function, I applied it to the count matrix. I then built the recommendation system function that would allow me to input the name of a program and have it output the most similar titles. The code block below displays the function I used, largely inspired by this article. What the function does is allow the user to input a title (the title must already be in the dataset, and will return an error if the title is not in the dataset), obtain the similarity score, and compare it to the other titles’ similarity scores, sorting them by similarity. The top five titles most similar to the inputted title are then outputted.
def get_recs(title, cosim=cosim):
idx = indices[title]
simscore = list(enumerate(cosim[idx]))
simscore = sorted(simscore, key=lambda x: x[1], reverse=True)
simscore = simscore[1:6]
program_indices = [i[0] for i in simscore]
return df['title'].iloc[_indices]
Results
In Figure 3, I’ve constructed a table of some titles (top row) I inputted into my recommendation system and list out the following recommended titles below each of them. These titles are Phineas and Ferb (Disney Channel animated series), High School Musical (Disney Channel Original Movie), Frozen (Disney animated film), Toy Story (Pixar film), and Avengers: Endgame (Marvel film). I opted out of inputting Star Wars and National Geographic titles since I personally don’t view them, but am familiar enough with the other titles. I would say overall that the recommendation system is rather successful in outputting similar titles.
- For Phineas and Ferb, other Phineas and Ferb programs would be recommended to the user. Sharpay’s Fabulous Adventure is recommended because of Ashley Tisdale’s involvement in that film and Phineas and Ferb as a lead actress.
- It’s evident under High School Musical (HSM) that the other HSM films are recommended along with its spinoff series. Kim Possible and Geek Charming are less related but still revolve around the setting of teenagers in high school.
- Frozen’s top three results definitely relay similar programs. Encore! is a series by Kristen Bell (who is featured as a lead actress in Frozen), which is also relevant. However, Winged Seduction: Birds of Paradise is a documentary I was not expecting, and I couldn’t seem to figure out why it was included.
- Under Toy Story, all titles are related. The first four are sequels to the first Toy Story or shorts developed by Pixar with the Toy Story characters. The fifth recommendation is a film starring Tim Allen, who also stars in Toy Story.
- Finally for Avengers: Endgame (I haven’t seen this film in its entirety but know enough about the Marvel Cinematic Universe to understand the results), all films recommended are also films that take place in the Marvel Universe.
Limitations
There’s likely always more information that can be included to help make a recommendation. For my recommendation system, I primarily wanted to focus on text-based features but could have included the MPAA rating, IMDB rating, Metascore, or have obtained information on the budget for films.
Additionally, I believe that some “obscure” or unrelated titles could be recommended with a given input if I expand the outputted list long enough. At just five recommendations, the documentary, Winged Seduction: Birds of Paradise, was suggested for Frozen, which likely is not a good recommendation. Other Disney animated films should have been recommended first. This means that there are some terms in both programs that lead them to be considered similar. The case is likely for other titles as well.
Conclusions
While this is a fairly basic recommendation system, it is still a rather effective one. The plot provides some context as to what the program is about as do the genres, and the director, writer, and actors provide some context as to how was involved in producing that program. Users may watch other programs by these people if they enjoy the first program. (For example, if you like both Phineas and Ferb and Ashely Tisdale, Sharpay’s Fabulous Adventure is likely a good recommendation for a program.) This is no Netflix Prize, but the algorithm here is still able to provide similar enough recommendations for a user.