Data Clustering Using Unsupervised Learning— What type of movies are in the IMDB top 250?

Han Man
han_
Published in
13 min readMar 17, 2017
There’s been a mistake…and it’s Steve Harvey’s fault

I was watching the end of the 2017 Oscars and as painful as it was, I was happy to see Moonlight win. What’s so great about La La Land anyways? A scrappy, young, beautiful, but mediocre-at-singing actress makes it big in Hollywood…it seems like this story has been done before. It got me thinking, what type of movies are rated well and typically well acclaimed? Well, maybe this story has been done before, but the academy could just have a type. Is this type a musical, or Ryan Gosling?

I decided to look at the IMDB top 250 movies to see if I could find some relationship between these movies. This was also a good chance to test out some new techniques I recently picked up, both in unsupervised learning and also dimensionality reduction.

My goal was to:

1. Use unsupervised learning to understand what type of natural clusters exist within the top 250 movies of all time (according to IMDB)

2. Use dimensionality reduction techniques to try to improve my clustering output on a small dataset where the features are very numerous relative to the number of samples

The winners

Data Acquisition and Processing

I used the data directly from the IMDB website, which was conveniently accessed through an API. Their website contained their top 250 rated movies:

Top 250 movies on IMDB.

The API required the movie id as an input, and would return relevant details about the movie. I first scraped the top 250 movies page on IMDB to acquire a list of movie IDs.

URL = "http://www.imdb.com/chart/top"
r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")
entries=soup.findAll('div', class_="wlb_ribbon")
movie_ids=[]
for a in entries:
movie_ids.append(a['data-tconst'])

Then, using this list of movie IDs, I queried the API.

header = 'http://www.omdbapi.com/?i='
movie_info=[]
for i in movie_ids:
url=header+i+"&"
r=requests.get(url).json()
movie=[]
for a in r.keys():
movie.append(r[a])
movie_info.append(movie)
columns=r.keys()
df=pd.DataFrame(movie_info, columns=columns)

This returned the following data on every film:

Data from the API pull.

I also pulled the budget and revenue of each movie to see if I could discern which type of movies generated more profit from the top 250.

content=[]
for a in movie_ids:
URL = "http://www.imdb.com/title/" + a
r = requests.get(URL)
content2.append(r.content)
print "done: " + str(n)
n+=1
budget=[]
revenue=[]
for soups in contentsoup:
entries=soups.findAll('div', class_="txt-block")
try:
budget.append(float((entries[9].text).split(":")[1].replace(",", "").replace("$", "").split("(")[0].replace(" ", "")))
except:
budget.append(np.NaN)
try:
revenue.append(float((entries[10].text).split(":")[1].replace(",", "").replace("$", "").split("(")[0].replace(" ", "")))
except:
revenue.append(np.NaN)

Then I had to clean each column. I decided to include:

Plot, Country, Writer, Director, Actors, Year, Genre, Runtime

The rest of the columns were either too obvious, such as IMDB Rating, or not relevant, such as IMDB ID. For the numerical columns, Runtime and Year, I first converted Runtime into a float in minutes. Then I converted Year to a binary- 1 for recent and 0 for not recent. I chose 1990 as the cutoff for recent movies.

For the rest of the text heavy columns, I had to pull out the most frequent occurrences for words and create dummy columns for those. The plot was the most challenging column because it contained many different words.

plots=list(df['Plot'])temp=""
for p in plots:
temp= temp + p
temp.replace(".", " ").replace(",", "")
temp=temp.split(" ")
temp=map(lambda x:x.lower(),temp)
import operator
freq={i:temp.count(i) for i in set(temp)}
sort = sorted(freq.items(), key=operator.itemgetter(1), reverse=True)
sort

Viewing the most frequently occurring words, most were pronouns and participles like “the”, “a”, etc. I combed through the list of words and chose the first 20 that were related to types of plot.

movie_words= ["young", "man", "help", "find", "life", "against", "war", "police", "family", "journey", 
"jewish", "son", "boy", "world", "love", "save", "dark", "friends", "murder"]

All of these words occur at least 5 times throughout the plots of the top 250 movies. There are the usual suspects for intriguing movies: “murder”, “love”, “war.” I immediately notice another pattern: there is a frequent occurrence of masculine word types such as “man”, “son”, “boy”. This confirms some of the backlash that has been circling the academy in the months leading up to the academy awards over neglect for movies with a focus on minorities, amplifying the win for Moonlight. If the movies naturally group around this male connotation, we will hopefully see it appear in our clustering exercise.

def plotcounts (a):
temp=a
temp.replace(".", " ").replace(",", "")
temp=temp.split(" ")
temp=map(lambda x:x.lower(),temp)
words=[]
for b in movie_words:
if b in temp:
words.append(b)
return words
df['plotNew']=df['Plot'].apply(plotcounts)def worddummy (a):
if m in a:
return 1
else:
return 0
for m in movie_words:
df[m]=df['plotNew'].apply(worddummy)

For the rest of the columns, such as director or actor, I took the top handful of actors or directors that appeared and created dummy columns for them. For example, with actors, I pulled the top 25 actors that appeared by frequency. Multiple actors/actresses appeared for each movie:

temp=[]
def countactors(a):
temp.append(a.split(", "))
return temp
b=df['Actors'].apply(countactors)
actors=[]
for a in b[0]:
for d in a:
actors.append(d)
freq={i:actors.count(i) for i in set(actors)}
sort = sorted(freq.items(), key=operator.itemgetter(1), reverse=True)
top_actors=[]
for a in sort[0:25]:
top_actors.append(a[0])

This was repeated for all of the other text columns. That left me with a dataframe with 250 entries, and 141 features. This is a daunting task for modeling: with so many features relative to data points, it will be difficult to extract relevant information, particularly due to the curse of dimensionality. Basically, as the feature set grows, the feature space grows exponentially, pushing the relative separation between data points to reach parity. That means that the feature set ceases to provide predictive power for grouping similar points because all points are equally dissimilar. You can think of this phenomenon as the difference between trying to stomp multiple ant on your floor versus catching flies in your room. There is just more space for flies to occupy in 3 dimensions, versus places the ants can go in two dimensions. Similarly, in n dimension space, there is vastly large distances between points, and the relative difference between one point and the other points all approaches infinity, and therefore indistinguishable from one another. I would have to use dimensionality reduction to effectively analyze this dataset.

Modeling the Data

Shawshank Redemption- the highest rated movie of all time

Now that the data has been gathered and compiled, I clustered the movie types. The Shawshank Redemption was rated the best movie of all time, but does it stand out relative to other movies in the top 250?

Principle Components Analysis:

Before fitting any clustering algorithm to the model, I reduced the number of features by using Principle Component Analysis or PCA. This is a method to decompose a large matrix into its orthogonal component vectors using the eigenvectors and eigenvalues of that matrix.

from sklearn.decomposition import PCA
pca = PCA().fit(X)
top_PCA=["%.2f" % a for a in pca.explained_variance_ratio_ if a >0.01]
len(top_PCA)sumall=sum(pca.explained_variance_ratio_)
pca24 = PCA(n_components=24).fit(X)
sum24=sum(pca24.explained_variance_ratio_)
print sum24/sumall
Top 24 components and the % of variance accounted for.

The first two components accounted for >25% but the rest of the components dropped off from there in terms of the amount of variance explained. I decided to take all of the components greater than 1% and that resulted in 24 components out of the 141 features that I started with. These top 24 components explained ~75% of the total variance in our data.

I decided to look further into our first component- which variables were most important to the first component?

first_comp = pca.components_[0]
first_comps = pd.DataFrame(zip(first_comp, X.columns), columns=['weights', 'features'])
first_comps['abs_weights']=first_comps['weights'].apply(lambda x: np.abs(x))
first_comps.sort('abs_weights', ascending=False)
The decomposition of the top contributors for the principle component.

RuntimeMin appears first with the most variance. This is the only continuous variable so this is troubling. It makes sense that the run time has the most variance encoded in it because it is continuous, but the fact that it’s weight is significantly higher than all other features means that perhaps, PCA doesn’t work that well with a dataset that is predominantly dummy columns. This is a good concept to keep in mind for future uses of PCA. Perhaps doing all dummy columns instead of leaving 1 as continuous would also help.

The other variables that come up are interesting: it is important to be a Drama, History, or Biography, but not important to be a Comedy. This aligns with our understanding that most top rated movies are in fact serious, not humorous.

KMeans Clustering:

Xpca24=pca24.transform(X)
from sklearn.cluster import KMeans
for n in range(2,50):
KM=KMeans(n_clusters=n)
KM.fit(Xpca24)
print str(n) + ": "+ str(silhouette_score(Xpca24, KM.labels_, metric='euclidean'))
KM5=KMeans(n_clusters=5)
KM5.fit(Xpca24)
kmeanlabels=KM5.labels_
Silhouette score for various n_clusters values.

Now that I had a reduced features space with 24 features, I could try some clustering algorithms.

I first tested a range for the number of clusters for the KMeans algorithm. This n defines the number of clusters that the algorithm strives to define as it models the data. I looked at the output of the silhouette score as I cycled over the n’s. It appears that the silhouette score does not improve significantly for any particular n.

This makes choosing a proper n difficult because the KMeans algorithm highly depends on this parameter. I decided to try DBScan instead, an algorithm that tries to determine the appropriate number of clusters on its own.

DB Scan Clustering:

for eps in [0.1,0.5,1,2,3,4,5,6,7,8,9,10]:
for min_samples in range(1,20):
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(Xpca24)
if len(set(db.labels_))>4 and len(set(db.labels_))<249:
print str(eps) + " " + str(min_samples) + ": "+ str(silhouette_score(X, db.labels_, metric='euclidean'))
dbOPT = DBSCAN(eps=2, min_samples=5)
dbOPT.fit(Xpca24)
dbscanlabels=dbOPT.labels_
X['kmean']=kmeanlabels
X['dbscan']=dbscanlabels
Silhouette score for DB Scan clusters using various eps, min_samples

Similarly, I cycled over various values of epsilon and min_samples to find a good silhouette score. Unfortunately, the silhouette scores for this output was not as high as I had hoped. I took the label output for the best performing model, for eps=2 and min_samples=5.

Note: min_samples is not setting the number of clusters to return from the model- rather the minimum number of samples near a particular point to designate a cluster.

It turns out that this best performing DBScan algorithm clustered the data into 5 classes. I went back to the KMeans algorithm and used the model with n_clusters=5. Using the the labels from both models, I was able to reach a number of conclusions.

Key Takeaways and Conclusion

The challenge with dimensionality reduction and clustering is visualization and interpretability. It is difficult to understand the output when so many features are used. Further complicating matters is the fact that PCA transforms our feature space so that it is difficult to parse out what each component decomposes into relative to our original set of features.

I decided to look at the columns in groups based on the original features- directors, actors, etc. I plotted the average value of the feature based on the different labels that I obtained from DBScan. By comparing the average value- ie. if the average value of the column for Drama is higher for class 1 versus class 2, I can imply that class 1 is the group that is heavily weighted to drama movies. Similarly, if Class 1 has a high mean for both Drama and Martin Scorcese, I can infer this group to be Scorcese Dramas and other similar movie types.

Plot features: average value by class
Actors: average value by class
Director: average value by class
Country: average value by class
Writer: average value by class
Genre: average value by class
Rating, Runtime, and Recent: average value by class

The labels for dbscan range from 0–4, creating 5 classes. Those classified as (-1) are outliers and are not classified in any cluster. Let’s see what movies fall into each of the following clusters. I put those labeled as -1 as cluster 5, and those labeled as 0 as cluster 6:

Cluster 1:

Cluster 2:

Cluster 3:

Cluster 4:

Cluster 5:

Cluster 6:

Attributes for Group 1:

Plot: Help, Save, Dark

Director: Christopher Nolan

Actors: Christian Bale, Leo DiCaprio, Tom Hardy, Michael Caine, Hugh Jackman

Genre: Drama, Adventure, Action, Sci-Fi

Rating and Era: PG-13, Recent

Attributes for Group 2:

Plot: Help, Jewish, Love

Actors: Charlie Chaplin

Genre: Comedy

Attributes for Group 3:

Genre- Adventure, Animation, Family

Plot- Young, Find, Journey, World

Director- Hiyao Miyazaki

Actor- Toshiro Mifune

Country: Japan

Attributes for Group 4:

Plot: Man, War, World

Director: Stanley Kubrick

Actor: William Holden

Genre: Adventure, Biography, History

Country: UK

Runtime: >1.5 SDs over mean

Attributes for Group 5:

Plot: Against, War, World

Director: Peter Jackson, Clint Eastwood, Sergio Leone, Akira Kurosawa

Genre: Adventure, Western, Action, Fantasy

Runtime: >1 SDs over mean

Attributes for Group 6:

The rest of the movies not included in previous clusters

__________________________________________________________________

This is not easy to break down, but examining these groups starts to give us an understanding for the properties within each grouping. By viewing their characteristics, I can more or less name these groupings:

Group 1- Christopher Nolan’s Universe: He tends to work with the same actors (Bale), and does recent, PG-13 movies, with drama and sci-fi. These speak to the Batman movies. But this also includes his other movies like Inception and Interstellar.

Group 2- Charlie Chaplin Comedies: Monopolizes most of the comedies in the top 250

Group 3- Hiyao Miyazaki Animations: From Japan, and involve Adventure, and Family with plots about Find, Journey, and World

Group 4- Long, Biographical Epics: A variety of biographical movies, typically during war, that are extremely long in runtime.

Group 5- Action/Adventure Conflicts: A variety of samurai, western, and war movies, whether real or fantasy in nature.

Group 6- All Other Movies: Everything else not falling in the clusters below are sufficiently similar so they are all lumped together.

There are actually some significant conclusions I can draw from these groupings. First of all, three dominant directors have monopolized the top movies and carved out very specific niches- Nolan, Chaplin, and Miyazaki. These can be considered three of the most prolific and enjoyed directors/writers of our time. These three outstanding creators have their own clusters of movies within the top 250, separating themselves from other movies within the top 250 with their subject matter (Nolan’s Universes characterized by darkness, and a hero saving the world), type (Miyazaki’s adventure animations), and genre (Chaplin’s comedies). There are other directors Steven Spielberg, Ford Francis Copolla, etc, who have also been similarly prolific within the top 250, but the other features of their movies are not sharply defined enough to be clustered together.

Group 4 consists of long biographical epics, and group 5 contains action/adventure movies. Group 4 includes biographies that have plots concerning “man”, “war”, and “world.” These movies average >1.5 SDs above the mean in run time, so they separate themselves as the longest running movies of the top 250. These tend to be historical and biographical epics that are long in runtime. This cluster brings together movies that are from different eras and different countries but are all long and biographical in nature, such as Ghandi, Barry Lyndon, and Lawrence of Arabia. This encouraging in the way our clustering is able to identify similar movie types with diverse subject matters.

Average runtime across movie clusters. Clusters labeled as 4 from the clustering algorithm have significantly higher runtimes compared with other clusters.
Average value of plot word feature columns by cluster label. Clusters labeled as 4 have “man”, “war”, and “world” as the most common appearing plot words.

Group 5 includes action/adventure movies from fantasy like Lord of the Rings series and Mad Max, WW2 (The Pianist, Inglorious Basterds), Westerns (Once Upon a Time in the West, The Good the Bad and the Ugly), and Japanese Samarai movies (Seven Samaria, Yohimbo). These two clusters are interesting because it groups together seemingly diverse movies and directors into one group. Group 5 were considered “outliers” by the DB Scan model, but even this result provided useful insight. Since they were outliers, they separated themselves from Group 6, which were the rest of the movies not included in other clusters. But these seemingly disparate movies within cluster 5 also reveal some hidden connections- the works of Peter Jackson, Clint Eastwood, Sergio Leone, and Akira Kurosawa are more similar than at first blush.

In fact, Kurosawa is often credited as the father of many other subgenres including some modern westerns like The Magnificent 7 and A Fist Full of Dollars. It is encouraging that the clustering algorithm was able to tease out the underlying connections between westerns and samarai movies, and the influences between these major directors.

Average runtime across movie clusters. Clusters labeled as -1 from the clustering algorithm, which fall into cluster 5, have high runtimes, averaging >1 SD from the mean.
Average value of director feature columns by cluster label. Clusters labeled as -1 group together movies directed by Akira Kurosawa, Sergio Leone, Quentin Tarantino, Clint Eastwood, and Peter Jackson.

There are also some lessons to be gleaned from this analysis. A data set that is 1) contains a large set of features relative to data points, and 2) full of dummy columns can be a challenge to creating coherent, interpretable clusters. These are important considerations for future analysis. The output of this analysis is useful to be cross referenced against a larger database of films to see which other films fall into these categories and the types of new clusters that would possible emerge with new data.

However, using this methodology has been surprisingly insightful and rewarding. This output suggests an outline of a similar methodology we see in movie prediction algorithms such as those used by Netflix. In order to recommend similar movies to the user, using clustering can reveal obvious similarities- Christopher Nolan has made a ton of amazing Batman movies- and less obvious- there is a group of amazing action/adventure movies from different directors and countries which are very similar to one another. Unsupervised learning, and dimensionality reduction are two powerful tools in the machine learning toolkit to which we can draw conclusions from a data set, even when there is no clear target variable to predict.

--

--