Begin & become good in web Scraping using Python #selenium #PhantomJS
Some time in machine learning and other data collection work, we need to do Web scraping. For one of my project, I needed to scrape a chunk of data from a website. So I learnt web scrapping from various internet sources and books.
If you are not afraid of code, believe me, it’s easy. You just need to know basics of DOM structure of a web page and different HTML element.
Before we proceed further follow this link. It’s a good place for a beginner to start with, everything is written step by step.
Following the link, below is a working code that scrapes information of ‘countries in world by population 2017’ from a website.
By now you must be familiar with Python execution, traversing HTML DOM structure of a web page, BeautifulSoup, csv & Requests libraries. By playing around above code you can also scrape data from various sites. :)
In the above scenario where desired data was accessible using a direct URL. Sometimes to reach the desired data, we need to click and/or type at certain places on a page (ex. type credentials and submit, click on a button to view details etc), also sometimes we need to solve captcha as well.
Let’s see how we can do it using code.
To interact with a web page we need a web driver and here comes the Selenium in the picture. Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. We can use functionalities of Selenium WebDriver to interact with a webpage.
Follow Python-Selenium link to install the selenium package.
Selenium web driver can interact with the various browsers. See some example here. Now every time when we use WebDriver like Firefox, ie or Chrome etc, the corresponding browser gets open.
What if we can interact with our web page without opening any browser? The answer to this is PhantomJS. Unlike other browsers, PhantomJS is a scripted headless browser. It is widely used for automating web page interaction, screen capture, and website testing. Download PhantomJS.
Now we have a web driver with a virtual browser. Let’s see an example
This was an example where we logged in by passing username-password and took the screenshot after that.
In my next article, we will scrape data from “National Voter’s Service Portal”. You can see a mandatory name field and captcha image is there. So we will try to solve that captcha as well by doing OCR using pytesseract library.
If you have any query, feel free to fire questions. #HappyWebScraping
Note: As per my knowledge, there is no strict general policy regarding the legalization of web scrapping but it is up to the site’s call so you can check legal policies and site’s privacy policies before you scrape the information.