Web Scraping Made Easy
Create a web scraping project by extracting data from the IMDb website.
Data collection plays an important role in data science. Most commonly we access data in CSV format or via an API but there are instances when the required data is available only as part of a web page, in such cases we can use a technique called web scraping.
Web scraping is an automated method to extract large amounts of data from the websites. It helps collect the unstructured data from the websites and store it in a structured form. In simple terms, web scraping not only automatically fetches the information from the websites but also stores it in an organised manner.
For easy retrieval, analysis, and manipulation of extracted data through web-scraping, it is preferred to save the data in formats viz. CSV, Xls, etc.
Getting Started
The web scraping process can be divided into four major parts:
- Reading: HTML page read and upload
- Parsing: To beautify the HTML code in an understandable format
- Extraction: Extraction of data from the web page
- Transformation: Converting the information into the required format.
IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online — including cast, production crew, and personal biographies, plot summaries, trivia, ratings, and fan, and critical reviews. We will perform web scraping on a specific webpage to list out the top 250 rated movies and fetch related information about them.
The URL of the webpage is - https://www.imdb.com/chart/top/.
You can also find the code on my Github to follow along.
Inspect the webpage
A web page that we see on the internet is written in HTML. To know which elements to target in your python code, we need to first inspect the web page. This can be done by following the instructions provided below:
Open web page -> right-click -> inspect.
Python Libraries
We will be using the following libraries :
- Request library: It is a Python library that is used to read the web page data from the URL of the corresponding page.
- BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract the data easily.
We will start by importing these libraries.
# import libraries
import requests
from bs4 import BeautifulSoup
Then specify the URL of the website to be scraped and access the site using the requests library.
# specify url
url=’https://www.imdb.com/chart/top/'# package the request, send the request and catch the response
page=requests.get(url)
Then make the connection to the webpage and parse the HTML using BeautifulSoup, storing the object in the variable ‘soup’.
# query the website and return the html to the variable ‘page’
page = urllib.request.urlopen(urlpage)# parse the html using beautiful soup and store in variable ‘soup’
soup = BeautifulSoup(page, ‘html.parser’)
You can print the soup variable at this stage which should return the full parsed HTML of the webpage we have requested.
print(soup)
Search for HTML elements
After taking a look at the IMDb webpage, we’ll extract the following information :
- Movie title
- Release year
- Audience rating
- Ranking
- Movie link
All of the results are contained within a class which we can search using the find
method.
# Locating class
content=soup.find(class_=”lister-list”)
We will be creating lists to store our information.
title = [] #List to store the movie title.
link = [] #List to store the movie link.
rank = [] #List to store the movie ranking.
year = [] #List to store the movie year.
rating = [] #List to store the movie rating.
Extracting results
Having a deeper look at the content
path the title, link, rank and year can be extracted from the titleColumn
class.
Title and Link :
The title is fetched by extracting text from the <a>
tag. You can get the link by using the get('href')
method by finding the <a> tag.
for tab in content.find_all('td',class_='titleColumn'):
title.append(tab.find(‘a’).text)
link.append(‘https://www.imdb.com' + tab.find(‘a’).get(‘href’))
Rank and Year :
To extract the rank and year, the text of the class is titleColumn
split and stored in a list. The first index fetches us the rank and the last index gives us the year.
lst = [i for i in tab.text.split()] # text will be split
rank.append(lst[0].strip(‘.’)) # append from 1st index
year.append(lst[-1].strip(‘()’)) # append from 1st index
Rating :
Lastly, the movie rating is fetched from the ratingColumn imdbRating
class.
The movie rating is enclosed inside the strong
tag. We can extract using the text method.
for rate in content.find_all(‘td’,class_=’ratingColumn imdbRating’):
rating.append(rate.strong.text)
Storing Data
The fetched data can now be stored as a Dataframe using pandas for further manipulation.
# import pandas library
import pandas as pd# column name
column=('rank','title','year','rating','link')# creating dataframe
df=pd.DataFrame(list(zip(rank,title,year,rating,link)), columns=column)# print dataframe
df.head(10)
This dataframe can also be exported to a csv using the to_csv
method.
#creating csv file
df.to_csv(‘imdb.csv’, index=False, header=True)
Thanks for reading and I hope you liked this article 😃.
Follow Data Science Community SRM to get regular updates on insightful content.