Up-Skilling myself in Data Science course and here is my first blog on the Web Scraping.
Problem Statement: Scrape any of the product of your choice, clean the data, generate the visualizations which helps to analyse and take the decision for purchase within the budget.
Let us understand what is Web Scraping:
Web Scraping is also called as Web Crawling, Screen Scraping, Web Data Extraction, Web Harvesting etc., It is the process of retrieving or “scraping” unstructured data from websites. It uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the websites and helps you to ease in your analysis.
For example: If you want to purchase any mobile within your budget, you search for the mobiles in the shopping sites like Flipkart, Snapdeal etc., Copy and paste the data manually in the excel sheet and analyse for decision making.
With the help of Web Scraping, you can extract the data automatically in fraction of seconds from unstructured format to the structured format which helps you to do the analysis and take decisions within time.
Parts of Web Scraping:
Web Scraping has 2 parts: 1. Web Crawler which helps you to move/navigate as per your instructions to search the data. 2. Scraper which helps you to copy/extract the data from Websites.
How it Web Scraping works:
It works like a BOT which does all the activities on behalf of Human in searching the products in the website, crawls all the pages by extracting the unstructured data in to the format you would like to have like DataFrames, Excel sheet, CSV file etc for future purpose.
Multiple libraries used in data extraction like Scrappy, Beautiful Soup etc. Most frequently used library is Beautiful Soup. I will be using the same. Here is more in detail about.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
I have considered to search for Mobiles from FlipKart website and here is the steps for the same:
Steps used in scraping the data from the Website:
- Find the URL(site) in which you would like to search
- Inspect/Crawl page by page in the Web Site.
- Right click on the page and click on Inspect which helps you to see the code used for displaying the mobile data in the Flipkart site. Entire code is written in HTML and you need to have basic understanding on the HTML to read the code.
- Create BeautifulSoup object by parsing the html to it which helps you to navigate to the HTML tags in that page.
- Use findAll function with BeautifulSoup object to locate the “div” tags in the HTML page which contains has all the features of the mobiles within. This acts as a container.
- Now locate the feature by using find function by passing the class name of the tag and save them in the respective variables.
- Extract all the values and store them in them the arrays.
- Create a data-frame for storing the scrapped data for future analysis.