Which movie should I watch?

Scraping Top 50 Movies on IMDb using BeautifulSoup, Python

Nishant Sahoo
4 min readMay 6, 2018

--

In this article, I will illustrate how to scrape a list of top 50 popular movies for each year from 1898–present as listed on IMDb, 2018.

Web scraping is data scraping used for extracting data from websites. The legality of web scraping varies across the world. In general, web scraping may be against the terms of use of some websites, but the enforce-ability of these terms is unclear. Always read the terms of use of the website you want to gather information from by data scraping.

Best Practices in Web Scraping -

  1. Iterative: Always make sure, your code is as iterative as possible, keeping it dynamic, and not hard-coding any static values. This helps in cases where the website changes the number of items on their page keeping the structure same.
  2. Compliant with Robots.txt and Terms & Conditions: Don’t breach the implied contract, limits, permits, or prohibitions of web scraping that can be found in the terms and conditions and/or the robots.txt file.
  3. Don’t Overburden the Website: Querying a website excessively will interfere with its normal processes, and slow down its performances. Make sure your queries aren’t excessive.
  4. Use an API: If a site has the ability to download data via an API, obtain data that way, as opposed to scraping (even if there is a fee involved).

About BeautifulSoup -

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is available for Python 2.6+ and Python 3.

Let’s have a look at the basic structure of any website -

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
.
.
.
</body>
</html>

Every page starts and ends with an html tag. It has two components: head, and body. Head section contains the title of the page. Body section is where the content of the page lies.

Basic usage of BeautifulSoup -

import urllib3
from bs4 import BeautifulSoup
url = "enter_url_here"
ourUrl = urllib3.PoolManager().request('GET', url).data
soup = BeautifulSoup(ourUrl, "lxml")
print(soup.find('title').text)

The above code prints the title of a website.

An example code to access a particular HTML tag with specific attributes in BeautifulSoup, say — “ <div class=’upper’> … </div> “ using the find function -

htmlEle = soup.find('div', attrs={'class': 'upper'})

Explore findChildren, and findAll functions in the official documentation page of BeautifulSoup.
Bonus: findAll function returns a list of html elements of the same type.

I will be using the following URL template to access each page having top 50 movies for a particular year.

url = “http://www.imdb.com/search/title?release_date=" + year + “,” + year + “&title_type=feature”

Let’s have a look at the basic HTML structure of the list of movies for the above URL-
(can be viewed on the browser by opening the developer tools; ctrl+shift+i)

<div class = "lister-list">  <div class = "lister-item mode-advanced">
.
.
<div class = "lister-item-content">
<h3 class = "lister-item-header">
...
<a href = "/title/..."></a> --> Movie title
...
</h3>
</div>
.
.
</div>
<div class = "lister-item mode-advanced">
.
.
.
</div>
<div class = "lister-item mode-advanced">
.
.
.
</div>
.
.
.
</div>

Code used to scrape the title of each movie for the above HTML structure -

i = 1
movieList = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
for div_item in tqdm(movieList): div = div_item.find('div',attrs={'class':'lister-item-content'})
print str(i) + '.',
header = div.findChildren('h3',attrs={'class':'lister-item-header'}) print 'Movie: ' + str((header[0].findChildren('a'))[0].contents[0].encode('utf-8').decode('ascii', 'ignore')) i += 1

Example output as on 6th May, 2018 for Top 50 movies in 2018 -

Most Popular Feature Films Released 2018-01-01 to 2018-12-31 : 
1. Movie: Avengers: Infinity War
2. Movie: Venom
3. Movie: A Quiet Place
4. Movie: Black Panther
5. Movie: I Feel Pretty
6. Movie: Deadpool 2
7. Movie: Ready Player One
8. Movie: Super Troopers 2
9. Movie: Rampage
10. Movie: Den of Thieves
11. Movie: Truth or Dare
12. Movie: Red Sparrow
13. Movie: Pacific Rim: Uprising
14. Movie: Halloween
15. Movie: Jurassic World: Fallen Kingdom
16. Movie: Blockers
17. Movie: Traffik
18. Movie: Isle of Dogs
19. Movie: 12 Strong
20. Movie: Solo: A Star Wars Story
21. Movie: Peter Rabbit
22. Movie: Dude
23. Movie: Maze Runner: The Death Cure
24. Movie: Crazy Rich Asians
25. Movie: Deep Blue Sea 2
26. Movie: The Commuter
27. Movie: The Week Of
28. Movie: Tully
29. Movie: Bohemian Rhapsody
30. Movie: Annihilation
31. Movie: The Equalizer 2
32. Movie: Winchester
33. Movie: Love, Simon
34. Movie: The Predator
35. Movie: Overboard
36. Movie: Aquaman
37. Movie: Bharat Ane Nenu
38. Movie: The Meg
39. Movie: Don't Worry, He Won't Get Far on Foot
40. Movie: Ant-Man and the Wasp
41. Movie: Mile 22
42. Movie: The Guernsey Literary and Potato Peel Pie Society
43. Movie: Hereditary
44. Movie: I Can Only Imagine
45. Movie: The Titan
46. Movie: Incredibles 2
47. Movie: Forever My Girl
48. Movie: Wildling
49. Movie: Life of the Party
50. Movie: Ocean's 8

To run this program yourself -
1. Clone this repo - https://github.com/nishantsahoo/IMDB_Top50_Scrape

2. Navigate to the project folder and run the following in the command line

python IMDB_Top50_Scraper.py

I hope you found this article useful. :3
Give me a clap, or two, or forty (👏) if you want to read more such stuff from me.

--

--