Most Popular Movie Genre Combinations

--

The Danish director of the movie Nymphomaniac revealed that he wants to introduce a new genre called Digressionism. Taking the word at its definition, genre is the “term for any category of literature or other forms of art or entertainment, e.g. music, whether written or spoken, audio or visual, based on some set of stylistic criteria.” The term dates back to ancient Greek literature. But, for writers, artists, and filmmakers, it’s usually the simplest, most practical way to categorize different styles of stories and content. We see genres while browsing through video stores or scrolling through Netflix, giving us a rough idea of what the stories are like or similar to.

Movie genres or film genres group different movies based on setting, characters, plot, story, tone, styles, syntax, templates, paradigms, motifs, rules, and themes. Sub-genres is a smaller category that may combine different elements from multiple genres.

It’s important to understand, though, that what we consider film genres today are, more often than not, hardly pure film genres, as they were in the early days of film. The majority of content produced in the last several decades are often genre hybrids, using the rules of genre theory to produce new, unique, and different stories.

So for this assignment I decided to analyze the hybrid of genres that are typically mainstream. I decided to consider genres of the movies that are top rated since they are critically acclaimed and can be considered substantial, rather than some insignificant ones. What better place to look for other than IMDB, The Internet Movie Database, an online database of information related to films, television programs, actors, production crew personnel, video games and fictional characters featured in visual entertainment media.so I went up and searched for the top rated movies of all time in their website.

1. Find and scrape data

I will be using web scraping process to gather the data that I need. I took the first page of IMDb “Top 1000” (Sorted by IMDb Rating Descending) as the source for my analysis. To make things easier I only considered the top 200 of it. I started the analysis by inspecting the html structure of the webpage. Each movie listed in the website were contained in an individual div tag lister-item mode-advanced. So to get the details of each movie, I need to loop through each of these div tags. Diving in deeper, I found out that the information I needed were between a span tag with a class genre which distinctly identifies the strings that I am looking for.

After clearly identifying the data I needed, I proceeded with gathering them. I imported all the libraries that I would need to perform this analysis.

  • Requests will allow us to send HTTP requests to get HTML files
  • BeautifulSoup will help us parse the HTML files
  • Pandas will help us assemble the data into a DataFrame to clean and analyze it
  • NetworkX provides classes for graphs
  • Matplotlib to plot the graphs

Next, I got the contents of the page by requesting the URL:

Make the content of the website I grabbed easy to read by using BeautifulSoup:

This got the whole content of the webpage into a readable format. But the content that I need resides in a particular div tag and only want that. So to get that we do:

This will get only the content inside the lister-item mode-advanced container.

Next I created an array arr = [] where I loop the content inside the div tag and store it here.

This for loop will go inside each div tag and looks for the span tag with the class of genre. Once it finds it, it will gather all the data inside it and appends the values inside arr[]

Now, when I print this array, I will get something like this

Using pandas, I transformed this raw data into a dataframe.

But some cleaning needs to be done in order to make this data in to much more usable form by removing the newline and empty spaces. I did that by

2. Create the Graph

Now we have something proper to work on. I compared the occurrence of the combination of the strings to find the number of times these genres combined in the top movie list.

So in my graph, the nodes will be the genres and the vertices will be the movies that had the combination of those genres. the weight of these vertices reflects the number of times that particular combination of genres happened over the list of movies. Since the vertices represent the time of combination, the direction of those vertices do not matter. Hence this graph would be called undirected graph.

For this analysis I decided to not include every genre that came up the list because some of them would not make sense to perform graph analysis over them. Hence I shortlisted the top 10 genre out of the list. I also excluded Drama, since any fictional movie would fall under this genre and it also makes no sense. by importing the networkx library, I was able to add the nodes, which are genres in this case.

The chosen genres are, comedy, Action, Crime, Adventure, Thriller, Horror, War, Sci-fi, Fantasy and Family.

The edges are added between the nodes. Here, the weight means the number of times this combination of genre happened among the top rated movies.

Plotting the layout in the form a circle would get us a network graph like this,

So our network graph shows the different combination of genres the top rated movies had. Here, the nodes are the genres and the weight of the vertices are the number of times the combination of the genres took place.

3. Graph Analysis

Sorting

By sorting the vertices based on their weight we can see that Action and Adventure were the two most famous combination that topped the list. Any combination with Drama would have topped the list but I excluded it since every movie, with an exception of a couple of the movies, are all fictional so they will fall under Drama. The next combination to top the list is Action and Fantasy. This is because of Star Wars, Lord of the Rings and Harry Potter franchises. The top 10 combinations always include Action or Adventure showing that serious movies are always critically acclaimed by the critics. The least and the most unexpected combinations were with Comedy, contributing very few. These surprising movies include The Great Dictator which is a combination of Comedy and War. Another movie is Snatch with a combination of Comedy and Crime. There are a lot of Comedy and Crime flicks made but only Snatch has been critically acclaimed.

Betweenness Centrality

Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another. So from the output, we can see that a large number of movies in the list were Action followed by Crime, Adventure and Thriller. As I said before, serious movies are the ones to top the list.

4. Limitations

The major limitation I had with this assignment is, I initially planned to create a Co-stardom network. The graph would have been very similar to what I created with the genres. The Actors would have been the node and the number of times they co starred together would have been the weight of the vertices between them. But the problem I faced is that, while scraping the texts I wanted to scrap were under similar div tags with similar classes and I couldn’t figure out a way to identify the tag I wanted.

The limitation with this network is that I couldn’t come up with an efficient way to group the genres and increment them based on their combinations.

--

--