Introduction to Web Scraping!

Akash Modak
Developer Student Club, HIT
5 min readJul 7, 2019

Hello guys! This is Akash Modak. I’m pursuing my Master degree in Computer Application from Heritage Institute of Technology, Kolkata. And also I’m the co-founder of Green Grass Studio (https://www.ggrassstudio.com). Currently, I’m working on Natural Language Processing. Here I’m going to talk about what is web scraping and the basic knowledge needed to implement it using Python and BeautifulSoup.

What is Web Scraping?

Web scraping is the process of constructing an agent which can extract, analyse and organise useful information from the web automatically. So in a nutshell it is the process of extracting data from websites. All these extraction is carried out by a piece of code which is scraper.

The data that can be extracted using a scraper can be of the following type-

  • Product items
  • Videos
  • Images
  • Text
  • Contact Information (i.e., emails, phone numbers, etc.)

Different components of Web Scraper and how it works?

The different components of Web Scraper are as follows:-

  • Web Crawler module
  • Extractor
  • Data transformation and cleaning module
  • Storage module

The Web Scraper works as follows:-

The Web Scraper will first download the contents of the requested web pages in an unstructured format(HTML format). Then the scraper will parse and extract structured data from the downloaded contents. And then the scraper will store and save the extracted data in any of the format like Plain text, CSV, JSON or database.

Understanding the structure of a web page!

You may feel free to skip this section if you are well aware about the structure of a web page.

Guys, it’s very important to know the structure of a web page and to know about the different tags used in HTML before performing scraping successfully.

The basic structure of an HTML page is:-

The basic structure of HTML

The above structure has various tags as elaborated below:

  • <!DOCTYPE html>: HTML documents must start with a type declaration.
  • HTML document contains between <html> and </html>.
  • The visible part of an HTML document is between <body> and </body>.
  • The <div> tag defines a division or a section in an HTML document. This is the tag that most of the time will help you to identify the data which you want to extract from the webpage.
  • HTML headings are defined by <h1> to <h6> tags.
  • HTML paragraphs are defined by <p> tag.

Some other HTML tags are:

  • <table>, <td>, <tr> tags are used to make tables in a webpage.
  • <ul>, <ol>: ul stands for unordered list and ol stands for an ordered list. Both of these tags are used to create a list on a webpage.

Now open the webpage in your browser which you want to scrape. Here I will scrape my website (link- https://akashmodak97.github.io/). Open this link in your browser. To understand the structure of the webpage press CTRL+SHIFT+I or just right click on that particular area which you want to extract and select Inspect.

Right-click on the web page and select Inspect
This is the structure of the web page

Libraries required for Web Scraping

As python is an open-source programming language you may find many libraries to perform one function. I prefer BeautifulSoup since it is easy and intuitive to work on. In this article, I’ll use two python modules for scraping data.

  • Urllib2: It is a python module that can be used for fetching URLs. For more details refer to the documentation page.

Note: urllib2 is the name of the library included in Python2. If you’re using Python3 then use urllib.request library instead. This library is already installed in your machine.

  • BeautifulSoup: It is a tool used for pulling out information from a webpage. You can use it to extract tables, lists, paragraphs, images, etc. from a website.

Note: If you’re using Jupyter Notebook or Spyder IDE then this library is already installed or else look at the installation process in its documentation page.

Now let us dive into the coding section!

The capturing of data from the web starts by sending a request to the web from which you want to capture the data with the help of urllib.request library.

Coding section:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url_name=”https://akashmodak97.github.io/#store the url in a variable

try:
(tab)page=urlopen(url_name) #connects with the webpage
except:
(tab)print(“Error in opening page”)

soup = BeautifulSoup(page,’html.parser’) #downloads the unstructured data(HTML format)
content = soup.find(‘div’,{“id”:”about”}) #The variable “content” will only contains the data of the division whose id name is “about”

#soup.find finds the portion which you want to scrape
article=’ ‘

for i in content.findAll(‘p’):
(tab)article=article+ ‘ ‘+i.text #finally this portion will extract those information which you were looking for and stores it in a variable
print(article)

#if you want to store it in a file then execute the next two lines

#if the data is of any language other than english then only write encoding=”utf-8"

file=open(“filename.txt”,”w+”,encoding=”utf-8")
file.write(article)

A snapshot of the program

PS: Please make sure that your device is connected to the internet while you extract the data from a webpage.

Hope you enjoyed reading and found it useful! Feel free to connect me on LinkedIn.

That’s all for Web Scraping, thanks for reading. If you enjoyed it then don’t forget to hit that clap button 👏🏻to help others find it.

--

--

Akash Modak
Developer Student Club, HIT

Software Engineer @Meta || Co-founder of Green Grass Studio || Mentor @Uplift Project || Former SDE Intern @ IT Dept Govt of West Bengal