Analyzing IMDb’s Top 250 movies: Part 1; Let scrape some data

S Dhanush
Analytics Vidhya
Published in
9 min readJan 20, 2021

Web scraping the IMDb Top 250 movies data for statistical analysis on the data

Photo by Denise Jans on Unsplash

Introduction 📔

Last month while out with a friend for a few drinks, we started discussing the many movies released this year as one does in these situations. During our ventures into the movies, I was intrigued by something my friend asked me;

What factors determine if a movie is successful?

This intrigued me enough that I wanted to perform a Statistical Analysis of the movies. Getting the movie data is quite easy thanks to IMDb (Internet Movie Database). As a side note, it is worth mentioning that IMDb provides a rather huge data dictionary with almost 7 million titles. But that is a can of worms I do not wish to open (I will delve into it in the future, just not right now at least). Here, I will use the IMDb charts for the Top Rated Movies (IMDb Top 250).

Let’s Scrape that data 🔍

The first step in analyzing the IMDb Top 250 data was to scrape it from the website and create a sensible Dataset out of it. Python provided me with some useful packages as BeautifulSoup and Requests that allowed me to easily scrape useful data from these movies. Let’s begin by first analyzing the page and its underlying HTML code to get the required data.

For those who don’t want to get into the nitty-gritty of the whole web scrapping process. Here is the Github link to the Python Jupyter Notebook. (Do ⭐️ it if you found it useful 💛)

IMDb top-rated movies (imdb.com/chart/top) and the corresponding source code

The page comprises a table showing the Top 250 movies. Each row in the table contains a poster of the movie, its rank and title, and its IMDb rating (It also had some other columns but these were the ones of interest to me). The title of the movie is a hyperlink that routes us to the details page of the particular movie. It was this link I was interested in acquiring. Once I had these links, I could then drill into the individual pages for each movie and gain a lot more data.

url = 'https://www.imdb.com/chart/top'
url_text = requests.get(url).text
soup = BeautifulSoup(url_text, 'html.parser'
template = 'https://www.imdb.com%s'
title_links = [template % a.attrs.get('href') for a in url_soup.select( 'td.titleColumn a' )]

With this simple code, I had gained links to all the movies. Now it was just a matter of going through each of these pages (with a for-loop, of course) and scrape the data I required.

The Shawshank redemption movie page showcasing the div with class = title_bar_wrapper

Now to the meat and potatoes of the whole web scrapping operation. The first and foremost set of data comes from the title bar. All this data is bundled up inside a div with the class name title_bar_wrapper.

From this, we can derive the following fields:

🚀 Name of the movie
🚀 Date and year of release
🚀 IMDb movie rating
🚀 The censor board rating that’s given to the movie (U, U/A, R, etc.)
🚀 Its total runtime
🚀 Its Genre(s)

movie_name = (page_soup.find("div",{ "class":"title_wrapper" }).get_text( strip=True ).split('|')[0]).split('(')[0]year = ((page_soup.find("div",{ "class":"title_wrapper" }).get_text( strip=True ).split('|')[0]).split('(')[1]).split(')')[0]rating = page_soup.find("span",{"itemprop":"ratingValue"}).text
vote_count = page_soup.find("span",{"itemprop":"ratingCount"}).text
subtext= page_soup.find("div",{ "class":"subtext" }).get_text( strip=True ).split('|' )

The above code scrapes all the above-mentioned fields and the subtext. The subtext contains the rest of the data. But there is a catch, few of the movies are unrated by the Censor Board. When the movie is unrated, the subtext array doesn’t have a field for it. So I came up with the below code as a work-around for this. When the movie is unrated, the subtext array doesn’t have a field for it. So in that case, I just added a censor rating called No Rating to those movies.

if len(subtext) < 4:
# Setting values when the movie is unrated
censor_rating = "No rating"
movie_length = subtext[0]
genre_list = subtext[1].split(',')
while len(genre_list) < 4:
genre_list.append(' ')
release_date_and_country = subtext[2].split('(')
release_date = release_date_and_country[0]
else:
censor_rating = subtext[0]
movie_length = subtext[1]

genre_list = subtext[2].split(',')
while len(genre_list) < 4:
genre_list.append(' ')
release_date_and_country = subtext[3].split('(')
release_date = release_date_and_country[0]

Below the poster, and the videos and images from the movie lay the next set of data points I could scrape; the plot summary (I really have no plans for this. I am just scraping it together right now since I can’t be bothered to do it later if I decide to perform NLP or something with it), the name of the Director of the movie, the writers of the movie, and the leading stars in the movie. The plot summary is easier since it can be scraped from the text in a div with the class name summary_text.

Plot summary and other plot-related credits for the movie The Shawshank Redemption

Scraping the rest of the details was going to be slightly trickier. Each of the other three details was wrapped in a separate div with the class name credit_summary_item. This meant that I had to find a way to scrape them all as a single object and then separate them. This is where BeautifulSoup’s find_all() method came in handy. This allowed me to scrape all the divs at once and make a list out of them. Once I had this list, I could pop each one out of the list, and just like that, I had my desired data.

# Getting the movie summary
summary = page_soup.find("div", {"class":"summary_text"}).get_text( strip=True ).strip()
# Getting the credits for the director and writers
credit_summary = []
for summary_item in page_soup.find_all("div",{ "class" : "credit_summary_item" }):
credit_summary.append(re.split( ',|:|\|' ,summary_item.get_text( strip=True )))
stars = credit_summary.pop()[1:4]
writers = credit_summary.pop()[1:3]
director = credit_summary.pop()[1:]
while len(stars) < 3:
stars.append(" ")
while len(writers) < 2:
writers.append(" ")
writer_1, writer_2 = writers
writer_1 = writer_1.split('(')[0]
writer_2 = writer_2.split('(')[0]

The last section I was interested in was the extremely Data heavy section at the bottom of the page called the Movie Details. This section contains a treasure trove of data points such as

🚀 Countries where the movies were shot.
🚀 Languages used in the movie.
🚀 Box office details.
🚀 Production Companies details and other such details.

Movie details for the movie The Shawshank redemption

Now, this section also had the same dilemma as the section containing the director, writers, and stars; as all of them were in separate divs with the same class name txt-block. So, while using the previous would have worked, I found that a better workaround was to use a Dictionary and store all these HTML resources in this dictionary. This way, I could easily get the data I was looking for using the provided name, process it, and get my desired data points. This was a much better approach in this case as I had many fields with the same class name and I desired several of these data points.

box_office_details = []
box_office_dictionary = {'Country':'','Language':'','Budget':'', 'Opening Weekend USA':'','Gross USA':'','Cumulative Worldwide Gross':'','Production Co':''}
for details in page_soup.find_all("div",{"class":"txt-block"}):
detail = details.get_text(strip=True).split(':')
if detail[0] in box_office_dictionary:
box_office_details.append(detail)
for detail in box_office_details:
if detail[0] in box_office_dictionary:
box_office_dictionary.update({detail[0] : detail[1]})
country = box_office_dictionary['Country'].split("|")
while len(country) < 4:
country.append(' ')
language = box_office_dictionary['Language'].split("|")
while len(language) < 5:
language.append(' ')
budget = box_office_dictionary['Budget'].split('(')[0]
opening_week_usa = ','.join((box_office_dictionary['Opening Weekend USA'].split(' ')[0]).split(',')[:-1])
gross_usa = box_office_dictionary['Gross USA']
gross_worldwide = box_office_dictionary['Cumulative Worldwide Gross'].split(' ')[0]
production_list = box_office_dictionary['Production Co'].split('See more')[0]
production = production_list.split(',')
while len(production) < 4:
production.append(" ")

Saving the data 💾

Huzzah! I have finally scraped all the data from these movies. Now I can store this data somewhere and then not do anything with it… I am Just kidding (Use humor in my blog — ✅😆). I wanted to store the data in two forms; as a JSON object and as a DataFrame. The DataFrame allowed me to easily extract the features and correlations I needed using Python and storing this data as a JSON object would make it easy to use this data with other programming languages. So let’s start with the easy one.

1. Storing as a JSON object

Now that I had scraped all the data I desired, I created a dictionary containing these data points and made a list with it to create an array of the movie details for all the 250 movies.

imdb_movie_list = []... #For loop
movie_dict
= { 'ranking': i+1, 'movie_name': movie_name, 'url': page_url, 'year': year, 'rating': rating, 'vote_count': vote_count, 'summary': summary, 'production': production, 'director': director, 'writers': [writer_1, writer_2], 'stars': stars, 'genres': genre_list, 'release_date': release_date, 'censor_rating': censor_rating, 'movie_length': movie_length, 'country': country, 'language': language, 'budget': budget, 'gross_worldwide': gross_worldwide, 'gross_usa': gross_usa,'opening_week_usa': opening_week_usa}
... #For loop ends here
imdb_movie_list.append(movie_dict)

I also added a timestamp into the mix so I can keep track of when I had scraped this data Next, to store this as a JSON object, I used Python’s package json and dumped the data into imdb_movies_data.json file.

timestamp =  datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f')
imdb_list = {
"timestamp" : timestamp,
"imdb_movies" : imdb_movie_list
}
with open('imdb_movies_data.json', 'w') as file:
json.dump(imdb_list, file)

2. Storing as a DataFrame into a CSV

Pandas DataFrame is a data structure that contains two-dimensional data with its corresponding labels. It is faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of Python and NumPy ecosystems. Externally, I could save the DataFrame as a CSV (comma-separated values) file to use at any point later on.

To begin with, I needed to create the DataFrame. I opted to set the columns with a list of properties for this. I had 33 columns and separated certain properties into multiple columns instead of saving them as List objects. This will reduce my workload down the line when I have to extract data from this with the desired set of properties.

dataframe_columns = [ 'ranking', 'movie_name', 'url', 'year', 'rating', 'vote_count', 'summary', 'production_1', 'production_2', 'production_3', 'director', 'writer_1', 'writer_2', 'star_1', 'star_2', 'star_3', 'genre_1', 'genre_2', 'genre_3', 'genre_4','release_date', 'censor_rating', 'movie_length', 'country_1', 'country_2', 'country_3', 'country_4', 'language_1', 'language_2', 'language_3', 'language_4', 'language_5', 'budget', 'gross_worldwide', 'gross_usa','opening_week_usa']dataframe = pd.DataFrame(columns=dataframe_columns)
Empty DataFrame with columns set using the list

The final step in creating the DataFrame was to store the data. This is very easy to implement given I had done the brainy move to store all the data in a list of dictionaries for all the movies. All I had to do now was to loop through the list and set the dictionary to the corresponding column.

for i in range(0, len(imdb_movie_list)):
dataframe.at[i,'ranking'] = imdb_movie_list[i]['ranking']
dataframe.at[i,'movie_name'] = imdb_movie_list[i]['movie_name']
dataframe.at[i,'url'] = imdb_movie_list[i]['url']
dataframe.at[i,'year'] = imdb_movie_list[i]['year']
dataframe.at[i,'rating'] = imdb_movie_list[i]['rating']
dataframe.at[i,'vote_count'] = imdb_movie_list[i]['vote_count']
dataframe.at[i,'summary'] = imdb_movie_list[i]['summary']
dataframe.at[i,'production_1']= imdb_movie_list[i]['production'][0]
dataframe.at[i,'production_2']= imdb_movie_list[i]['production'][1]
dataframe.at[i,'production_3']= imdb_movie_list[i]['production'][2]

dataframe.at[i,'director'] = imdb_movie_list[i]['director'][0]
dataframe.at[i,'writer_1'] = imdb_movie_list[i]['writers'][0]
dataframe.at[i,'writer_2'] = imdb_movie_list[i]['writers'][1]
dataframe.at[i, 'star_1'] = imdb_movie_list[i]['stars'][0]
dataframe.at[i, 'star_2'] = imdb_movie_list[i]['stars'][1]
dataframe.at[i, 'star_3'] = imdb_movie_list[i]['stars'][2]
dataframe.at[i,'genre_1'] = imdb_movie_list[i]['genres'][0]
dataframe.at[i,'genre_2'] = imdb_movie_list[i]['genres'][1]
dataframe.at[i,'genre_3'] = imdb_movie_list[i]['genres'][2]
dataframe.at[i,'genre_4'] = imdb_movie_list[i]['genres'][3]
dataframe.at[i,'release_date'] = imdb_movie_list[i]['release_date']
dataframe.at[i,'censor_rating'] = imdb_movie_list[i]['censor_rating']
dataframe.at[i,'movie_length'] = imdb_movie_list[i]['movie_length']
dataframe.at[i,'country_1'] = imdb_movie_list[i]['country'][0]
dataframe.at[i,'country_2'] = imdb_movie_list[i]['country'][1]
dataframe.at[i,'country_3'] = imdb_movie_list[i]['country'][2]
dataframe.at[i,'country_4'] = imdb_movie_list[i]['country'][3]
dataframe.at[i,'language_1'] = imdb_movie_list[i]['language'][0]
dataframe.at[i,'language_2'] = imdb_movie_list[i]['language'][1]
dataframe.at[i,'language_3'] = imdb_movie_list[i]['language'][2]
dataframe.at[i,'language_4'] = imdb_movie_list[i]['language'][3]
dataframe.at[i,'language_5'] = imdb_movie_list[i]['language'][4]
dataframe.at[i,'budget'] = imdb_movie_list[i]['budget']
dataframe.at[i,'gross_worldwide'] = imdb_movie_list[i]['gross_worldwide']
dataframe.at[i,'gross_usa'] = imdb_movie_list[i]['gross_usa']
dataframe.at[i,'opening_week_usa'] = imdb_movie_list[i]['opening_week_usa']
dataframe = dataframe.set_index(['ranking'], drop=False)dataframe.to_csv('imdb_movies_data.csv')

This concludes the 1st part of my Analysis of the IMDb Top 250 movies. Having scraped all this data, the next was to extract the datasets I fancied and perform statistical analysis on them. I will share all my findings with you once this analysis is ready.

Once again, the complete code is available here at GitHub in Python Jupyter Notebook. Drop ⭐️ it if you liked it.

I hope you enjoyed and learned something interesting about scraping data from this article. Thank you for coming along with me on this part of my journey. If you have questions, doubts, or thoughts on this, please feel free to 👏 and comment. Thanks!

--

--

S Dhanush
Analytics Vidhya

🔑 Java, REST, Python, React.js, Node.js 🔥 Full Stack Developer