Scrape the web — It’s fun.

However, everything fun comes with it’s own restrictions.So, is web scraping. Hence, always confirm beforehand if web scraping is allowed for a particular website.

Anju Rajbangshi
Analytics Vidhya
8 min readMay 9, 2020

--

Why do we need web scraping?

Well, in today’s world, we live in a pool of data.Imagine if we could understand all the hidden patterns and come up with useful insights, every business would be booming in the industry.

In my opinion, it is a give and take policy. Customers get what they want and businesses make profit out of it. Does it sound simple? Well, it would be if we had easy access to properly structured and clean data to analyze the information in order to come up with creative abstractions. But this isn’t the case in real scenarios.

The first requirement in a data science project whether it is machine learning or artificial intelligence is the collection of data. Wouldn’t life be easier for a data scientist if we received structured data. But how do we make our life easier.

Collect the data right! But from databases? from excels? from text files? from websites?

That’s what we are going to discuss in this article about one of the ways to gather the information.

Let’s understand with another example.

Suppose, we have a scenario where my machine learning model is neither performing well on training data nor on the test data. This concept is termed as underfitting in machine learning. And what is one of the easiest and effective ways to solve underfitting? Isn’t it having enough amount of data to train our model? So, what can rescue us from such unfortunate times.

Well, web scraping can come as a saviour helping us with the data gathering part.

Photo by Franki Chamaki on Unsplash

Now coming to the topic, most of us know that various sites like instagram, twitter, facebook etc have specific APIs which can help developers to easily get access to different types of data they are looking for by performing a few initial easy steps. However, I would leave this part for my readers to explore it.

So, now comes the question ? Why do we need to scrape the websites if they have API’s and if there are lots of easy ways to collect the required information. It is because not all websites provide us with such a facility thereby raising the dire need of web scraping.

Wikipedia definition of web scraping :

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Also why can’t we use the data directly from the website?

Below wonderful explanation of web scraping I came across, answers the above question:

Web scraping is a tool for tuning the unstructured data on the web into machine readable, structured data which is ready for analysis.

In the data science domain, simple python scripts and libraries can help us in achieving our goal along with a basic idea of HTML.

There are various tools and ways like selenium, scrapy and also multiple python libraries such as urlib for web scraping. However, as the name suggests, we have an amazing parsing library in python called “BeautifulSoup” which I would be using for my task.

We have lots of tutorials on the web to educate ourselves on any topic, however, the underlying skill lies on how we can optimize our knowledge and use it for our own needs. Hence, lets get into practice.

Let’s list out the various steps involved in web scraping:

Image Source: Quora.com

1> The first and most important step is making the HTTP request to the website we want to scrape and getting back the response.

2> Once the response is received, we feed the downloaded content to html parser like Beautiful Soup and extract the information we need.

3> Now that we have the required data, we need to save it somewhere. Hence, the third and the easiest step is to store these inputs into a csv/excel file or we can even load it to a compatible database.

Now, since I love food and many of us do, in this article, I am going to share with you all a simple web scraping experience I really enjoyed and implemented to gather details like names, address, price and offer available for top 10 restaurants in Bangalore.

But before that! Is web scraping legal? How do we know if we are actually allowed to extract data from the particular website.

Well, we can just verify this by checking the robots.txt file of the website­­. And if you see something like “User_Agent * allow: /”, then we are good to go. More details on this are available in this blog I found in stackexchange page.

Finally we are going to implement the process step by step:

  • First we will import the required libraries like pandas, requests and BeautifulSoup.
  • Next we will get the URL of the website we want to scrape and set the headers.
  • More information regarding headers can be found in below links.

https://www.whoishostingthis.com/tools/user-agent/

http://go-colly.org/articles/scraping_related_http_headers/

  • We all know that whenever we type a website address on any browser, an HTTP request is sent to access the website which will then display the desired webpage if the request is successful. Similarly, here we will send an HTTP GET request to the URL to download the HTML content using the Request library of Python.
  • Then use parsing library ‘BeautifulSoup’ to fetch and parse the downloaded data. Now, here specifically I want to mention the prettify() function. We can also print the downloaded content without using it but applying the prettify() function really makes the data look pretty , neat, more readable and understandable.
  • Now we need to first Inspect the page we want to scrape using any browser with the help of developer tools and try to understand the structure of the code according to the section we are looking for. And then analyze the HTML in our notebook,extract the required information and store it in CSV/Database/JSON etc.

Since, I want the details like restaurant names, address, price and available offers , I will look for the HTML tags which stores the information I need.

The above code indicates that all the HTML div tags containing class equals to “restnt-main-wrap clearfix” will be returned which contains all the restaurant information and saved in ‘rest_details’. I also checked the length of ‘rest_details’ to verify if the correct count of the data I need is fetched.

And to extract further information, we will access each restaurant content using a loop.

The above code extracts only the HTML content I need.

Hence, we will retrieve the required information which are all stored under different HTML tags and classes .

For eg:

  • Restaurant name is stored under anchor tag <a> </a>.
  • Restaurant address is stored under HTML div tags containing class equals to “restnt-loc ellipsis”.

Similarly, the remaining information. Now after analyzing the above HTML content , we can write below code:

# Extracting the restaurant details:

Output: Finally we are successful in retrieving the details.

Now coming to the most important part — Storing the data.

First I created a list “restaurant_details” and then initialized dictionary ‘dataframe’.Finally stored the dictionary in list.

Output sample.
Sample df output

Finally, we will load the required information to CSV file which will be saved under ‘users’ folder in c drive.

Ultimately, we have the data saved in structured format in CSV file ready for our analysis.

The complete code for the above is available in github for reference : https://github.com/anjuraj-ops/Projects-in-data-science/blob/master/web%20scrape%20restaurants%20details.ipynb

I have also done a project on extracting details like “Natural plant’s name” under sale and their “prices” from another website. The code for this is available in github as well : https://github.com/anjuraj-ops/Projects-in-data-science/blob/master/web%20scrape%20plants.ipynb

Last but not the least, web scraping is fun. Stay tuned as I delve into more complex web scrapings like image scrapings, scraping multiple web pages , multiple URL’s etc.

Let us all learn together. Happy Learning !!

And as always, I am open to any feedback.

--

--