Machine Learning and Content Clustering of disney.com
What does Disney mean to you? Your answer is probably different from mine, and our answers are likely different from the next random person on the street. Along with the diverse set of personalities in the Disney audience, there are also many types of content (movies, shows, characters, etc.) with varying levels of appeal to each of those personalities. Because of the wide range of personality types and content types, there is a need to discover categories of taste and how these categories relate to the various types of guests and content. Getting to a better understanding of this will help to deliver a better experience to the diverse world of Disney guests.
Teaching Robots to Understand Us
Traditional methods of categorizing content have often considered demographic breakdowns such as age, gender, or location. There certainly is value in looking at those characteristics, but we wanted to try something a little different and take a fresh approach which does not rely on preconceived notions of category breakdowns. Instead, our goal was to consider actual patterns of content popularity and allow these real-world observations to guide the creation of categories.
Our starting point was a large set of data about popularity of pages and videos on disney.com. To make the analysis simpler, we rolled up the individual pages and videos to the level of properties— these are things like movies and shows. A machine learning technique was then used to automatically discover features which describe the properties and how closely related those features are to each property. This information can then be used to understand similarities between properties and how to cluster properties that appeal to similar personality types.
If you’re really curious about the process, it was a Latent Dirichlet Allocation method which produced a set of features and affinity ratings from those features to each property. Those affinity ratings can be interpreted as Vector Space coordinates and the similarity between two properties can be found by calculating a Cosine Similarity. We’ll leave it to Wikipedia to give the more detailed lectures on those topics.
Clustering Results
This process creates a number of clusters of content, and we can look at the content to guess a cluster’s topic. We can see a number of interesting topics in the resulting clusters. The important point to note here is that expert knowledge is not used to create the clusters. For example, when we look at the cluster with Disney Princess type content, we didn’t need to manually input knowledge about which movies and characters are considered Disney Princesses. Instead, an unguided process automatically discovers the coherent group of content by using similarities in the patterns of popularity. When we look at the groups after they are created, we see that one contains a lot of Disney Princess type content.
Samples of the created clusters:
Topic: Disney Princesses
The Little Mermaid, Cinderella, Tangled, Beauty and the Beast, Snow White and the Seven Dwarfs
Topic: Tweens
High School Musical, Hannah Montana, Selena Gomez, Bridgit Mendler, Teen Beach Movie
Topic: Disney Channel
Gravity Falls, Wizards of Waverly Place, Fish Hooks, Austin & Ally, Shake It Up
Topic: Disney XD
Avengers Assemble, Kick Buttowski, Randy Cunningham: 9th Grade Ninja, Lab Rats, Spider-Man
Topic: Animated Movies
The Lion King, Fantasia, Toy Story, Bambi, Finding Nemo, The Secret World of Arrietty
Cluster Topic: Live Action Movies
National Treasure, Bedknobs and Broomsticks, Glory Road, The Parent Trap, The Mighty Ducks
Cluster Topic: Really Popular Stuff
Wreck-It Ralph, Monsters University, Where’s My Water?, UP, Iron Man
(Possibly a qwirk of the data set and not really a content cluster. Content that was very popular or heavily promoted during the time covered by the data set could possibly have the same patterns of popularity and appear to be very similar.)
Further Work
The results suggest that this method could be a valuable tool for better understanding our content and how it relates to our guests. Also, the automated nature of the method is capable of adapting to the changing tastes of our audience since it is based on real-world data and not on ideas of demographic trends that can become outdated. However, these methods should not be considered a replacement for human intelligence since there are still parts that are open to interpretation. This can be seen in the last cluster for “really popular stuff” which could be the result of oddities of the analysis method. We could likely improve the results with a more complete data set or by further dividing the clusters into smaller groups. In short, the results we saw above are a starting point for thinking about how to apply machine learning towards the understanding of our content.