[1]Web Scrapping using python| IMDb Top Box Office Movies

Bhattbhoomi111
3 min readJul 20, 2021

--

It is the process of extracting information and data from a website, transforming the information on a webpage into structured data for further analysis. Web scraping is also known as web harvesting or web data extraction.

Libraries for web scrapping

As we all know, In Python there are different libraries for different purposes. we will be using the following libraries:

  1. BeautifulSoup : is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
  2. Requests : will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.
  3. Pandas : pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

To extract data using web scraping with python, you need to follow these basic steps:

1: Find the URL :

In this example, we are going scrape IMDb website to extract the Title of movie, weekly growth, gross, and number of weeks for top box office movies(US). The URL for this page is https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht

2 : Inspecting the Page

Just right click on the element and click on “Inspect”.

3 : Find the data you want to extract

In this example I’m going to extract data inside movies Title name, weekly growth, overall gross and number of weeks for that, which is in the “div” tag respectively.

4 : Write the code

To do this, you can use Google Colab or Jupiter book. I am using Google Colab for this.

Import libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Create empty arrays, we will use them in future for storing data of specific column.

TitleName=[]
Gross=[]
Weekend=[]
Week=[]

open the URL and extract the data from the website

url = "https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht"
r = requests.get(url).content

Using the Find and Find All methods in BeautifulSoup. We extract the data and store it in the variable.

soup = BeautifulSoup(r, "html.parser")
list = soup.find("tbody", {"class":""}).find_all("tr")
x = 1
for i in list:
title = i.find("td",{"class":"titleColumn"})
gross = i.find("span",{"class":"secondaryInfo"})
weekend = i.find("td",{"class":"ratingColumn"})
week=i.find("td",{"class":"weeksColumn"}

Using append we store the details in the Array that we have created before

TitleName.append(title.text)
Gross.append(gross.text)
Weekend.append(weekend.text)
Week.append(week.text)

5. Store the data in a Sheet.We store the data in Comma-separated values (CSV format)

df=pd.DataFrame({'Movie Title':TitleName, 'Weekend':Weekend, 'Gross':Gross, 'Week':Week})
df.to_csv('DS-PR1-18IT012.csv', index=False, encoding='utf-8')

6. Now run the whole code.

All the data are stored as IMDbRating.csv in the path of the Python file.

For whole code GitHub

Thank You!!

--

--