Web Scraping Made Easy

Create a web scraping project by extracting data from the IMDb website.

Kartik Mohan
DataX Journal
5 min readMay 18, 2020

--

Data collection plays an important role in data science. Most commonly we access data in CSV format or via an API but there are instances when the required data is available only as part of a web page, in such cases we can use a technique called web scraping.

Web scraping is an automated method to extract large amounts of data from the websites. It helps collect the unstructured data from the websites and store it in a structured form. In simple terms, web scraping not only automatically fetches the information from the websites but also stores it in an organised manner.

For easy retrieval, analysis, and manipulation of extracted data through web-scraping, it is preferred to save the data in formats viz. CSV, Xls, etc.

Getting Started

The web scraping process can be divided into four major parts:

  • Reading: HTML page read and upload
  • Parsing: To beautify the HTML code in an understandable format
  • Extraction: Extraction of data from the web page
  • Transformation: Converting the information into the required format.

IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online — including cast, production crew, and personal biographies, plot summaries, trivia, ratings, and fan, and critical reviews. We will perform web scraping on a specific webpage to list out the top 250 rated movies and fetch related information about them.

The URL of the webpage is - https://www.imdb.com/chart/top/.

You can also find the code on my Github to follow along.

Inspect the webpage

A web page that we see on the internet is written in HTML. To know which elements to target in your python code, we need to first inspect the web page. This can be done by following the instructions provided below:

Open web page -> right-click -> inspect.

Inspecting the webpage

Python Libraries

We will be using the following libraries :

  • Request library: It is a Python library that is used to read the web page data from the URL of the corresponding page.
  • BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract the data easily.

We will start by importing these libraries.

Then specify the URL of the website to be scraped and access the site using the requests library.

Then make the connection to the webpage and parse the HTML using BeautifulSoup, storing the object in the variable ‘soup’.

You can print the soup variable at this stage which should return the full parsed HTML of the webpage we have requested.

Search for HTML elements

After taking a look at the IMDb webpage, we’ll extract the following information :

  • Movie title
  • Release year
  • Audience rating
  • Ranking
  • Movie link

All of the results are contained within a class which we can search using the find method.

We will be creating lists to store our information.

Extracting results

Having a deeper look at the content path the title, link, rank and year can be extracted from the titleColumn class.

Inspecting titleColumn

Title and Link :

The title is fetched by extracting text from the <a> tag. You can get the link by using the get('href') method by finding the <a> tag.

Rank and Year :

To extract the rank and year, the text of the class is titleColumn split and stored in a list. The first index fetches us the rank and the last index gives us the year.

Rating :

Lastly, the movie rating is fetched from the ratingColumn imdbRating class.

Inspecting ratingColumn imdbRating

The movie rating is enclosed inside the strong tag. We can extract using the text method.

Storing Data

The fetched data can now be stored as a Dataframe using pandas for further manipulation.

DataFrame

This dataframe can also be exported to a csv using the to_csv method.

Thanks for reading and I hope you liked this article 😃.

Follow Data Science Community SRM to get regular updates on insightful content.

--

--