Selenium- For data scraping
There still exists websites without any APIs. Scraping data from such sites can be very time-consuming and manual. I created samples for open-event-app generator. One of the samples that I created was for AllofHands Hawaii 2016. This site didnt have any API to enable easy data scraping.
How do we find out if a website is using an API or not?
Using Google Chrome, go to
View → Developer → Developer Tools. Under the
Network →XHR look for API endpoint with a bit of Hit and Trial method. (XHR stands for XMLHttpRequest)
However, what if there is no API being used in the site? How would you scrape data in that case? Will you now manually click onto every hyperlink on the site and visit every page to get the data by manually copying and pasting it? Could there be someone doing that manual job for you? Or better could there be “something” doing that job for you? Yes, It’s selenium.
SELENIUM- WEB BROWSER AUTOMATION
Selenium is a tool that automates the task of browsing through the internet. Although, technically it is used for web testing purposes but there is no restriction to it’s utility.
Let’s get started with basics of Selenium:-
- Run the following command
pip install selenium(Quick Tip: It is advised to use
- Selenium requires drivers to run. Different browsers use different drivers. Choose an appropriate driver for your browser. some common drivers are shown below (Source)-
BASIC FUNCTIONALITIES- SELENIUM:
Visit a page ( using the get() ):
Navigating to various elements on the visited/current webpage:
- BY ID:
WebElement element = driver.findElement(By.id(“ui_elementid”));
- BY CLASS NAME:
List<WebElement> cheeses = driver.findElements(By.className(“cheese”));
- BY TAG NAME:
WebElement tag = driver.findElement(By.tagName(“tag_name”));
- BY CSS:
WebElement cs = driver.findElement(By.cssSelector(“#”));
- BY LINK TEXT:
WebElement cheese = driver.findElement(By.linkText(“blog”));. If the element href is something like
- BY XPATH:
List<WebElement> xp = driver.findElements(By.xpath(“//input”))