Web Crawling [Java][Selenium]

Information Retrieval

Web crawling is one of the most popular way of information gathering mechanism. Because today we are living in a world that we do have everything on the internet in many heterogeneous forms. There are Web sites which have lots of information where human intervention is less practical to capture and summarize those data. As a solution, we let machines to do the hard work on behalf of us.

But the problem is some web sites required a login or may be there can be other verification mechanisms like to prove ourself as a human. Most of the time those websites required to maintain sessions to navigate within. How do we automate these things?

There are two methods, an ethical one and unethical one. Ethical approach is to register our crawler for the particular web site. In this approach we are officially requesting to go through their content. For example Google and Yahoo use their crawlers to go through different web sites and index them in to their search engine service. Unethical approach is to put a browser in front of our application and pretending as a real user we can retrieve information. We can use a Selenium web driver to maintain the browser interaction with the web site that we need to crawl.

Process Flow

In this tutorial we are focusing on a java application that can be used to crawl a Web on top of Selenium library.

So you might need following tools with you,

In this tutorial I am using the Safari Browser. But you can use any browser as you preffered. But yet you have to choose the relevant Selenium web driver by changing the dependency.

About Tutorial

For this tutorial we are going to retrieve the information form a table in the following website. This table contains different versions of Selenium Safari web driver.

https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-safari-driver

Selenium Safari Driver Version Table

Setting up the project

Including Dependencies

Create a maven project and include following dependencies.

<dependencies>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.5.3</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-safari-driver</artifactId>
<version>3.5.3</version>
</dependency>
</dependencies>

Choose a selenium version ≥ 2.o.

Configure the Browser

In order to build the connection between Safari browser and Selenium web driver at application end, we have allow browser for remote access.

Safari > Preferences > Advanced > Tick “Show develop menu in menu bar”

Now you can see that the “Develop” menu item has added to your browser menu bar. Now you have to allow remote automation.

Develop > Allow Remote Automation

Implementation

Now let’s write a simple code and check whether things are working properly. So, I’m writing a small script that could navigate to the relevant web page and print the page content in the console.

public static void main(String[] args) {
String url = "https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-safari-driver";

WebDriver driver = new SafariDriver();

// Navigate to URL
driver.get(url);

// Read page content
String content = driver.getPageSource();

// Print the page content
System.out.println(content);

// Close driver
driver.quit();
}

If you have done your configurations properly, this will show you some HTML content in your console without popping an error. :)

Now let’s move one step forward. We don’t need to print the entire web content our console. We have to extract the table content separately. You need to use Web Inspector to find the relevant tag that contains the required information.

Develop > Show Web Inspector

When you go through each tag that showed in the inspector window it will highlight the actual location of the website. So, you can easily find the related tag that contains the information you are looking for.

Web Inspector Window

Right click on the selected element and copy the Xpath. It would be like this,

//*[@id=”maincontent”]/div[4]/table

String url = "https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-safari-driver";

WebDriver driver = new SafariDriver();

// Maximize browser window
driver.manage().window().maximize();

// Navigate to URL
driver.get(url);

// Find the table element using xpath
WebElement table = driver.findElement(By.xpath("//*[@id=\"maincontent\"]/div[4]/table"));

We have found our table element. Now we have to go through each row of the table and extract information.

Structure of the table

There are two sections in the table, table head (<thead>) and table body (<tbody>). Data contains within the table body tags. So, we have to get rid of <thead> tag. Within a <tdody> tag there can be many rows that contains different versions related to a major version. Structure of <tr> tag is given below.

Structure of a row

Implementation for iterate through rows would be like this.

// Go through each major version
List<WebElement> mainVersions = table.findElements(By.tagName("tbody"));

for(WebElement mver: mainVersions) {
for(WebElement ver: mver.findElements(By.tagName("tr"))) {
}
}

Now we need to extract the information about Version, Repository, Usages and Date for particular version. As you can see, each information is wrapped inside a <td> tag. First we need to go through <tr> tags to get the relevant row of the table and then over <td> tags within it.

You can easily find the tags using their class name and then retrieve the underlying data. Based on the structure we do have some special cases.

  1. Version stored in an <a> tag that have two possible class names “vbtn release” or “vbtn beta”.
  2. Usage is stored within <a> (anchor tag) that does not have a class name.
  3. Date is stored directly within a <td> tag.

We have to take extra precautions to tackle those special cases.

For the first problem we can search for class name “vbtn release” and if not tag exists then we can go and look for “vbtn beta”. Or else we search element by tag name and select the relevant index. It is easy to go by index.

List<WebElement> attributes = ver.findElements(By.tagName("a"));
WebElement version = attributes.get(0);
WebElement repository = attributes.get(1);
WebElement usages = attributes.get(2);

Date comes under last <td> tag. We can use java iterables getlast element to retrieve those tags.

WebElement date = Iterables.getLast(ver.findElements(By.tagName("td")));

We have completed our code :). Now let’s look at the complete implementation.

Output would be like this,

Output

Why we need to use a Web Driver?

  1. Because using a HTML parser library like Jsoup we cannot load content that are dynamical changed such as Javascript contents.
  2. Sometimes we need to maintain a session in order to crawl for long period of time.
  3. There are web sites that need a user account to access. In that case, we can automate login through the web driver and access the web page.

References

  1. https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-safari-driver