Data Science: Web Scraping with EDA to extract product data from eCommerce sites
What is Web Scraping?
Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web.
In a simple words, web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user.
Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.
Why and where we can use Web scraping?
Here are the fields where web scrapping transferring the world with its applications
• Job Boards
• Marketing & Sales
• Data Journalism
But why does someone have to collect such large data from websites?
• Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.
• Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.
• Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
• Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.
- Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.
How does Web Scraping work?
When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it.
To extract data using web scraping with python, you need to follow these basic steps:
- Find the URL that you want to scrape
- Inspecting the Page
- Find the data you want to extract
- Write the code
- Run the code and extract the data
- Store the data in the required format
Now let us see how to extract data from the Flipkart website using Python
Understanding e-commerce sites product data
We need to understand that we are going to extract. For demonstration purpose — let us see how to extract data from the Flipkart website using Python. Note the fields we need to extract
1) Product URL
3) Product Name
5) Average star rating
Ways to scrape: We’ll cover basic scraping using techniques and frameworks in Python with some code snippets.
There are different libraries available in Python (Python is the most popular language for scraping) we can use for scraping.
Python libraries used here for Web Scraping
• Selenium: Selenium is a web testing library. It is used to automate browser activities.
• BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.
• Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.
• Used matplotlib and seaborn for data visualization
- Multi-Paging Data Scrapping concept.
below is the code to open the URL, we are going to scrape Flipkart website to extract the Price, Name, and Rating of Smart watch.
The data we want to extract is nested in ‘div’ tags. So, I have find the div tags with those respective class-names, extract the data and store the data in a variable.
To extract data from multiple pages of the product listing we’re going to use a for loop. The range will specify the number of pages to be extracted.
As mentioned above, data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.
Input we gave “smart watch” as a product name
Here we will use Zip function to get the combined result in Single DataFrame based on the product,ratings and price
Here we are converting Rating and Price into float datatype to perform any mathematical operations.
df describe set
Visualizing product data with “Price”
Visualizing product with “rating”
Visualizing the data using Price Vs Rating.
Best product based on rating users can buy within budget of INR 1000–25000
Here my aim is, by using Web scraping with Python to find out best smart watch by considering key factors like budget amount of Rs1000–25,000 and user experience rating. Above are the products which customers can consider.
Thank you all,hope this blog will help you to understand about web scraping with EDA. I’d like to grow my readership. Can you help me out by sharing this blog post?