Which movie should I watch?

Scraping Top 50 Movies on IMDb using BeautifulSoup, Python

Nishant Sahoo
May 6, 2018 · 4 min read

In this article, I will illustrate how to scrape a list of top 50 popular movies for each year from 1898–present as listed on IMDb, 2018.

  1. Iterative: Always make sure, your code is as iterative as possible, keeping it dynamic, and not hard-coding any static values. This helps in cases where the website changes the number of items on their page keeping the structure same.
  2. Compliant with Robots.txt and Terms & Conditions: Don’t breach the implied contract, limits, permits, or prohibitions of web scraping that can be found in the terms and conditions and/or the robots.txt file.
  3. Don’t Overburden the Website: Querying a website excessively will interfere with its normal processes, and slow down its performances. Make sure your queries aren’t excessive.
  4. Use an API: If a site has the ability to download data via an API, obtain data that way, as opposed to scraping (even if there is a fee involved).

About BeautifulSoup -

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is available for Python 2.6+ and Python 3.

Let’s have a look at the basic structure of any website -

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
.
.
.
</body>
</html>

Every page starts and ends with an html tag. It has two components: head, and body. Head section contains the title of the page. Body section is where the content of the page lies.

import urllib3
from bs4 import BeautifulSoup
url = "enter_url_here"
ourUrl = urllib3.PoolManager().request('GET', url).data
soup = BeautifulSoup(ourUrl, "lxml")
print(soup.find('title').text)

The above code prints the title of a website.

An example code to access a particular HTML tag with specific attributes in BeautifulSoup, say — “ <div class=’upper’> … </div> “ using the find function -

htmlEle = soup.find('div', attrs={'class': 'upper'})

Explore findChildren, and findAll functions in the official documentation page of BeautifulSoup.
Bonus: findAll function returns a list of html elements of the same type.

I will be using the following URL template to access each page having top 50 movies for a particular year.

url = “http://www.imdb.com/search/title?release_date=" + year + “,” + year + “&title_type=feature”

Let’s have a look at the basic HTML structure of the list of movies for the above URL-
(can be viewed on the browser by opening the developer tools; ctrl+shift+i)

<div class = "lister-list">  <div class = "lister-item mode-advanced">
.
.
<div class = "lister-item-content">
<h3 class = "lister-item-header">
...
<a href = "/title/..."></a> --> Movie title
...
</h3>
</div>
.
.
</div>
<div class = "lister-item mode-advanced">
.
.
.
</div>
<div class = "lister-item mode-advanced">
.
.
.
</div>
.
.
.
</div>

Code used to scrape the title of each movie for the above HTML structure -

i = 1
movieList = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
for div_item in tqdm(movieList): div = div_item.find('div',attrs={'class':'lister-item-content'})
print str(i) + '.',
header = div.findChildren('h3',attrs={'class':'lister-item-header'}) print 'Movie: ' + str((header[0].findChildren('a'))[0].contents[0].encode('utf-8').decode('ascii', 'ignore')) i += 1

Example output as on 6th May, 2018 for Top 50 movies in 2018 -

Most Popular Feature Films Released 2018-01-01 to 2018-12-31 : 
1. Movie: Avengers: Infinity War
2. Movie: Venom
3. Movie: A Quiet Place
4. Movie: Black Panther
5. Movie: I Feel Pretty
6. Movie: Deadpool 2
7. Movie: Ready Player One
8. Movie: Super Troopers 2
9. Movie: Rampage
10. Movie: Den of Thieves
11. Movie: Truth or Dare
12. Movie: Red Sparrow
13. Movie: Pacific Rim: Uprising
14. Movie: Halloween
15. Movie: Jurassic World: Fallen Kingdom
16. Movie: Blockers
17. Movie: Traffik
18. Movie: Isle of Dogs
19. Movie: 12 Strong
20. Movie: Solo: A Star Wars Story
21. Movie: Peter Rabbit
22. Movie: Dude
23. Movie: Maze Runner: The Death Cure
24. Movie: Crazy Rich Asians
25. Movie: Deep Blue Sea 2
26. Movie: The Commuter
27. Movie: The Week Of
28. Movie: Tully
29. Movie: Bohemian Rhapsody
30. Movie: Annihilation
31. Movie: The Equalizer 2
32. Movie: Winchester
33. Movie: Love, Simon
34. Movie: The Predator
35. Movie: Overboard
36. Movie: Aquaman
37. Movie: Bharat Ane Nenu
38. Movie: The Meg
39. Movie: Don't Worry, He Won't Get Far on Foot
40. Movie: Ant-Man and the Wasp
41. Movie: Mile 22
42. Movie: The Guernsey Literary and Potato Peel Pie Society
43. Movie: Hereditary
44. Movie: I Can Only Imagine
45. Movie: The Titan
46. Movie: Incredibles 2
47. Movie: Forever My Girl
48. Movie: Wildling
49. Movie: Life of the Party
50. Movie: Ocean's 8

To run this program yourself -
1. Clone this repo - https://github.com/nishantsahoo/IMDB_Top50_Scrape

2. Navigate to the project folder and run the following in the command line

python IMDB_Top50_Scraper.py

I hope you found this article useful. :3
Give me a clap, or two, or forty (👏) if you want to read more such stuff from me.