Data Science 101:- Web Scraping

Darshil Patel
Analytics Vidhya
Published in
5 min readJul 29, 2021

Data collection by web scrapping with python

Introduction

Hey there! Data science is only possible with data, and in the real world, it’s not easily available. You have to go after it. That’s why web scraping is very important for data science.

This article will cover how you can leverage Python libraries like Beautiful Soup and pandas to get relevant information off the web and perform Web Scraping. I’ve added the GitHub repository link at the end of the article for those who would want the complete code.

What is Web Scraping?

Unlike the long and tedious process of manually obtaining data, Web Scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time. It provides you with a technique to get access to structured web data in an automated fashion.

The process for web scraping can be broadly categorized into three steps:

  1. Understand and inspect the web page to find the HTML markers associated with the information we want.
  2. Use Beautiful Soup, Selenium, and/or other Python libraries to scrape the HTML page.
  3. Manipulate the scraped data to get it in the form we need.

Libraries Used For Web Scraping

For implementing web scraping in Python, these are a few libraries that must be included —

  1. Beautiful Soup: It is used to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
  2. Pandas: Pandas allows importing data from various file formats such as comma-separated-values, JSON, SQL, Microsoft Excel. Pandas allow various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.
  3. Requests: It allows you to send HTTP/1.1 requests with ease and it does not require manually add query strings to your URLs, or to form-encode your POST data.

Let’s start with our example.

Get ready to perform web scraping. Follow the steps to get your data.

Step 1: Select the URL that you want to scrape data from

Here, I am going to scrape the IMDb website to extract the MOST POPULAR MOVIES, their Title, year, and IMDb ratings. The URL for the same is https://www.imdb.com/chart/moviemeter.

Step 2: Inspect the Page

This step involves investigating the web page’s HTML. To check which tags contain our information, right-click on the element in your browser and select “Inspect”. The data is usually nested in tags. So, we inspect the page to figure out under which tag the required data is nested.

Step 3: Find the data to be extracted

In this example, I am going to extract the title, year of release, and IMDb ratings for the Most Popular Movies. In this case, the relevant information in the table is associated with the tag <td>. Using the tags, our main scraping libraries can target and parse information effectively throughout the scraping process.

Step 4: Write the python code

Let’s start by creating a Python file. Here, I have used Google Colab. You can use any Python IDE.

Import the required python libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Create empty arrays to store the scraped data.

Title=[]  #List to store title of most popular movies
Year=[] #List to store releasing year of each movies
Rating=[] #List to store ratings of each movies

Now, open the URL and extract the data from the website. Using the Request library, make a request to the web page and get its HTML. The above code stores the HTML content of our web page into an BeautifulSoup object. Using Beautiful Soup’s Find and Find All functions, we can easily store the required information in the variables. Lastly, append your data to previously created empty arrays. And you’re done!

url = "https://www.imdb.com/chart/moviemeter"#Make a request to the web page and gets it's HTML
content = requests.get(url).content
#Store the HTML page in 'soup', a BeautifulSoup object
soup = BeautifulSoup(content, "html.parser")
for i in soup.find("tbody", {"class":"lister-list"}).find_all("tr"):
h = i.find("td",{"class":"titleColumn"})
title = h.find("a", href=True)
year = i.find("span",{"class":"secondaryInfo"})
rating = i.find("td",{"class":"ratingColumn imdbRating"})

Title.append(title.text)
Year.append(year.text)
Rating.append(rating.text.strip("\n"))

Step 5: Store the scraped data in Comma-separated values (CSV format)

Now using the imported Pandas Library, create a dataframe in which the data is stored in a structured way to export it into the desired file format. Here I have exported the data in .csv format.

df = pd.DataFrame({'Most Popular Movies' : Title,'Year' : Year,'Rating' : Rating})
df.to_csv('IMDb.csv', index=False, encoding='utf-8')#read the data stored in IMBd.csv file
data = pd.read_csv('IMDb.csv')

Step 6: Run your code

You can view all the scraped data in the IMDb.csv file as shown below.

Scraped data stored in IMDb.csv file

Conclusion:

We can fetch any data from a webpage by using a web scrapping library like beautiful soup, scrappy, etc. After converting into Pandas we can apply all pandas functions on that data.

More about Pandas Function here.

More about Beautiful Soup here.

That’s it. Hope you find this blog helpful. Check out the entire code on my GitHub profile.

LinkedIn:

--

--

Darshil Patel
Analytics Vidhya

CS Grad @ University of Texas at Dallas | Passionate and Versatile Software Engineer | Expertise in Full-Stack Development, Machine Learning, and Data Science