Web-Scraping with Visualization and Analysis in Python

Up-Skilling myself in Data Science course and here is my first blog on the Web Scraping.

Problem Statement: Scrape any of the product of your choice, clean the data, generate the visualizations which helps to analyse and take the decision for purchase within the budget.

Let us understand what is Web Scraping:

Web Scraping is also called as Web Crawling, Screen Scraping, Web Data Extraction, Web Harvesting etc., It is the process of retrieving or “scraping” unstructured data from websites. It uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the websites and helps you to ease in your analysis.

For example: If you want to purchase any mobile within your budget, you search for the mobiles in the shopping sites like Flipkart, Snapdeal etc., Copy and paste the data manually in the excel sheet and analyse for decision making.

With the help of Web Scraping, you can extract the data automatically in fraction of seconds from unstructured format to the structured format which helps you to do the analysis and take decisions within time.

Parts of Web Scraping:

Web Scraping has 2 parts: 1. Web Crawler which helps you to move/navigate as per your instructions to search the data. 2. Scraper which helps you to copy/extract the data from Websites.

How it Web Scraping works:

It works like a BOT which does all the activities on behalf of Human in searching the products in the website, crawls all the pages by extracting the unstructured data in to the format you would like to have like DataFrames, Excel sheet, CSV file etc for future purpose.

Multiple libraries used in data extraction like Scrappy, Beautiful Soup etc. Most frequently used library is Beautiful Soup. I will be using the same. Here is more in detail about.

Beautiful Soup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

I have considered to search for Mobiles from FlipKart website and here is the steps for the same:

Steps used in scraping the data from the Website:

  1. Find the URL(site) in which you would like to search

Web Scraping with Analysis in Python

Import all the python packages required

Give the input of the product you are looking for along with the Budget range

Now, lets create the lists for storing the product and its features along with the dataframe to store the entire data extract from unstructured format to structured format.

Create user agent below,For reference https://pypi.org/project/fake-useragent/

Extract data from multiple pages of the product listing we’re going to use a for loop. The range will specify the number of pages to be extracted. The data that we extract is unstructured data. So we’ll create empty lists to store them in a structured form

Here is the code to crawl and scrape the data from the website. Once extracted entire data from the web site, store it in the data frame.

Here is the HTML code snippet which we need to identify the class for the respective parameters for extraction:

I have extracted the product name, its price and rating for simplicity. Lets try to display the data we have extracted.

Now, we need to do the data cleaning for our analysis. Lets trim the currency “₹” and “,” and have only the price in the column. We use regex function by converting the datatype of the Price column to String as the datatype of all the columns is Object by default. Once it is trimmed, convert the datatype of Price and Rating to float type as below:

Now lets see the data types after conversion:

Let’s see the data as well after trimming and conversion

Now, we need to filter the price range which was given as input

Observed that the filtered data based on the price is displayed. It will difficult for us to understand what is the highest price with the rating for the product…. means we need to do a sorting in descending order on the price column and then by the Rating. This helps us to understand the highest rating of the mobile based on the price.

Here is the code to sort the data.

Let’s see the result after sorting. I am trying to display top 25 products.

Wow… it is sorted now as per our requirement.

Let us start visualizing the data in pictorial representation as picture speaks more than words.

Visualize the Mobile against the price.

Similarly let’s try to plot the visualization for Products against the rating.

We can see that most of the mobile are rating more than 4 rating.

Now, let us plot the graph rating against the price using seaborn library:

With the above visualization we understand that all the mobiles are rated more than 4.4 within our budget.

Good… now let us see what are the top 3 mobiles with the high rating.

Conclusion:

Based on the above analysis, we understand that with in my budget (25K to 60K), I have got the top 3 mobiles with a very good rating.

This helps us to take quick decision to purchase mobile.

Same code can be used to scrape any product in the Flipkart website and come with the decisions quickly.

Hope it helps you to write the code and do the detailed visualization and analysis.

Will share few more blogs from my learning soon……. :)