Web Scraping in Python

Shriya Gupta
Analytics Vidhya
Published in
4 min readSep 23, 2019

Introduction

Currently world is moving towards Data Science and Machine Learning. The fuel which these fields require is data and we get data with below two main sources:

1) Using API:

These API’s are exposed by various website which allows retrieval of data. Example: Facebook graph API

2) Web Scraping:

In this technique a webpage is scraped to extract useful information. This technique is called web scraping or web harvesting or web data extraction.

Steps Involved in Web scraping:

Step 1:

Install the required third party library. This can be done with below commands

pip install requestspip install html5libpip install bs4

Step 2:

Import library in python project or your notebook

If you are wondering why we need these libraries and where we are using it. Don’t worry we got you covered in below steps:

Step 3:

Sending HTTP request to a url of the webpage for which you want to access data. In this step, server will respond with html content of the webpage which can be used to extract data.

Sending http request is done in python by using requests library

Lets use an example website to see how to do this

Step 4:

So far things are pretty simple. Now let’s make it bit more interesting by parsing this html response using BeautifulSoup. This library is build on top of various libs like html5lib, lxml, html.parser, etc.

Lets use this in our code

Here,

r.content : Raw HTML content.

html5lib : Specifying the HTML parser.

soup.prettify(): It gives the visual representation of the parse tree created from the raw HTML content.

Step 5:

Now let’s search and parse this tree which we got in our soup variable. If you look carefully we are interested in content which has div with id as results

We can extract this div with below line

Phew! we are now very close to our data. Let’s write a loop over this div to find all td with which we can retrieve useful information.

In above piece of code, we have created a dictionary to save all information present in a row of table. The nested structure can be accessed using dot notation. To access the text inside an HTML element, we use .text

Step 6:

Final step! Dumping this data into excel. This involves few sub steps:

  1. Importing excel writer (Can be downloaded using pip install xlwt)

2. Create a excel workbook with one tab

3. Write header information to this sheet1

4. Now write the information which we extracted from the webpage

5. Save the excel

Now lets look at how our excel looks like

Congratulations! You have successfully scraped a webpage to extract its useful contents. Similarly you can try this in other webpages.

Can I scrape any website???

You can do scraping on pretty much any website, most websites tend to implement blocks to prevent their website from being scraped. You can find whether or not these blocks are there in the robots.txt file or in the terms of service. However these rules are by no means enforced by law.

--

--