Building Web Scraping API with Java +Spring boot + Jsoup

Sushain Dilishan
5 min readOct 19, 2021

--

Overview

We will be building an API to extract data from two vehicle selling websites and scrape out the ads based on the vehicle models we pass to the API. This kind of API can be consumed from a UI and display all the ads from multiple websites in a single place.

To learn more about web scraping click here.

Tools used to build this API

  • IntelliJ as our IDE of choice
  • Maven 3.0+ as build tool
  • JDK 1.8+

Getting Started

First we need to initialize a project with spring initializer

This can be achieved by visiting http://start.spring.io/

Make sure to select below dependencies as well:

  • Lombok : Java library that makes the code cleaner and gets rid of boilerplate code.
  • Spring WEB : Product of the Spring community focused on creating document-driven Web services.

After initializing the project we will be using two third party libraries JSOUP and apache commons. These dependencies can be added to our pom.xml file.

<dependencies>

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.11</version>
</dependency>


<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

Analyzing HTML to Scrape Data

Before starting the implementation of our API we have to visit https://ikman.lk/ and https://riyasewana.com/ to locate the data that we have to scrape from these websites

We can do this by launching the above mentioned websites on browser and inspect the HTML code using Dev tools.

On Chrome , Right click the page and select inspect.

The result should look like this.

Ikman.lk
Riyasewana.com

After opening the websites we have to traverse through the HTML to identify the DOM where ad list is located. Then these identified elements will be used in our Spring boot project to get the relevant data.

From traversing through the ikman.lk HTML we can see that the list of ads are placed under the class name list — 3NxGO.

Then we have to do the same for Riyasewana.com. In Riyasewana.com the ad data is placed under the div which has id content

After identifying all this information lets us now build our API to scrape these data!!!.

Implementation

Lets first define our website urls in our application.yml/application.properties file

website:
urls: https://ikman.lk/en/ads/sri-lanka/vehicles?sort=relevance&buy_now=0&urgent=0&query=,https://riyasewana.com/search/

Then lets create our simple model class to map the data we get from HTML

package com.scraper.api.model;

import lombok.Data;

@Data
public class ResponseDTO {
String title;
String url;
}

In the above code we use Data annotation generate getters and setters for our attribute.

Then lets create our service layer and extract the data from the websites.

package com.scraper.api.service;

import com.scraper.api.model.ResponseDTO;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

@Service
public class ScraperServiceImpl implements ScraperService {
//Reading data from property file to a list
@Value("#{'${website.urls}'.split(',')}")
List<String> urls;

@Override
public Set<ResponseDTO> getVehicleByModel(String vehicleModel) {
//Using a set here to only store unique elements
Set<ResponseDTO> responseDTOS = new HashSet<>();
//Traversing through the urls
for (String url: urls) {

if (url.contains("ikman")) {
//method to extract data from Ikman.lk
extractDataFromIkman(responseDTOS, url + vehicleModel);
} else if (url.contains("riyasewana")) {
//method to extract Data from riyasewana.com
extractDataFromRiyasewana(responseDTOS, url + vehicleModel);
}

}

return responseDTOS;
}

private void extractDataFromRiyasewana(Set<ResponseDTO> responseDTOS, String url) {

try {
//loading the HTML to a Document Object
Document document = Jsoup.connect(url).get();
//Selecting the element which contains the ad list
Element element = document.getElementById("content");
//getting all the <a> tag elements inside the content div tag
Elements elements = element.getElementsByTag("a");
//traversing through the elements
for (Element ads: elements) {
ResponseDTO responseDTO = new ResponseDTO();

if (!StringUtils.isEmpty(ads.attr("title")) ) {
//mapping data to the model class
responseDTO.setTitle(ads.attr("title"));
responseDTO.setUrl(ads.attr("href"));
}
if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);
}
} catch (IOException ex) {
ex.printStackTrace();
}
}

private void extractDataFromIkman(Set<ResponseDTO> responseDTOS, String url) {
try {
//loading the HTML to a Document Object
Document document = Jsoup.connect(url).get();
//Selecting the element which contains the ad list
Element element = document.getElementsByClass("list--3NxGO").first();
//getting all the <a> tag elements inside the list- -3NxGO class
Elements elements = element.getElementsByTag("a");

for (Element ads: elements) {

ResponseDTO responseDTO = new ResponseDTO();

if (StringUtils.isNotEmpty(ads.attr("href"))) {
//mapping data to our model class
responseDTO.setTitle(ads.attr("title"));
responseDTO.setUrl("https://ikman.lk"+ ads.attr("href"));
}
if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);

}
} catch (IOException ex) {
ex.printStackTrace();
}
}

}

After writing our scraping logic in the service layer now we can implement our RestController to fetch all the data from both the websites.

package com.scraper.api.controller;

import com.scraper.api.model.ResponseDTO;
import com.scraper.api.service.ScraperService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.Set;

@RestController
@RequestMapping(path = "/")
public class ScraperController {

@Autowired
ScraperService scraperService;

@GetMapping(path = "/{vehicleModel}")
public Set<ResponseDTO> getVehicleByModel(@PathVariable String vehicleModel) {
return scraperService.getVehicleByModel(vehicleModel);
}
}

After that everything is done. Now all we have to do is Run the Project and test the API!.

Go to a RestClient and call your API by providing a vehicle model.

ex : http://localhost:8080/axio

As you can see above you get all ad urls and titles related to provided vehicle model from both the websites.

Endnotes

In this Article you learnt how you can manipulate a HTML document with jsoup and spring boot and scrape the data from two websites. My next step would be:

  • Improving the API to support pagination in both the websites.
  • Implementing a UI to consume this API

Source Code

source code is available on my GITHUB

Thanks! for reading this article Feel free to leave any comment. you can find me on linkedin

--

--