An introduction to Generic/Framework Web Data Extraction

Bhagwati Malav
Hash#Include
Published in
4 min readMar 11, 2018

Data extraction play vital role in many applications. Web crawling can help in getting data from various sources. There are various web crawling framework available like jsoup, xsoup, scrapy, html unit etc. It again depends on the programming language which you are interested in using to build crawler. Jsoup is a java library which allow you to extract, clean and manipulate data from given html. You can extract data based on html attribute, attribute value, tag, style, text etc. It also allows you to traverse through various html element as it comes with lot of traversing methods. you can even use regular expression to get any html element. There are many real world applications where we really need to build clean collection of specific entity to get proper information on given entity. Ex. Your company sells various brands cars, handphones etc. which again depend on region and many other factors.

Let say you got a new task to do for the same. So now how should you approach to given problem ? Lets say you stay in india, and you found https://www.cardekho.com/ , and you want to build clean collections of cars based on brands and other required factors. You learnt about jsoup in your college days for some exercise, so you are excited to take up this challenge, and started writing crawling for given site. And it is going good. Customers are happy with kind of information they get on your site. After few days you started getting feedback that you are missing few attributes/specifiction, or few models on your site. You see a problem here, and you start your analysis on this, and you somehow end up browsing https://www.zigwheels.com/ which shows missing attributes, and few new car models. You are happy you found solution to given problem. So you need to again write code to crawl this site as html structure would be different here. You are done with writing crawler for zigwheels. So you are again on the track. What if you get same feedback again in future as this site might not give you few other models or few attributes. Now your mind is in dilemma. You spend lot of time on analysing various sites. And you find html structure is different for each site. You discuss this with your colleagues, and trying to get right solution. So you need to make it generic, but how ?

Again you read about various extraction libraries, and xpath, css selector. but the whole point is that you can only make it generic if it is possible to find any given attribute uniquely. When i say uniquely, there should not be any dependency on order of html tag ex. td, tr, span etc. You read about css selector, xpath. You initially give it a try with xpath, but xpath might not work for you as it has again dependency on order of html tag. Ex.

For product url : https://www.cardekho.com/maruti/swift-2018 mileage attribute, you get below given xpath

Xsoup :Public static String URL = " https://www.cardekho.com/maruti/swift-2018";
String xpathMileage = "//html/body/main/section[1]/table/tbody/tr/td[1]/table/tbody/tr[2]/td/table/tbody/tr[3]/td[1]"
Document document = Jsoup.parse(URL);

String result = Xsoup.compile(xpathMileage).evaluate(document).get();
Assert.assertEquals("https://github.com", result);

if you somehow externalise this configuration, it makes your life easy as tomorrow if you find new source you don’t need to write any extra code. You just need to push configuration to external source ex. db. It works fine for you, again you started thinking it would always work. But after few days you find there are model which don’t have some attributes, and it is giving you different attribute values as order got changed. Now you need to avoid this, and get another solution. You hear about css selector from someone, and do lot of research again to make it generic.

We discussed that if you can identify any attribute uniquely it is possible to make it generic. So jsoup works for you here. You came with an idea which uses text based search. ex.mileage. You have check various sites again, find common pattern which is key — value pair. So if i find key uniquely irrespective of its html order, i can make it generic even if its order get changed. So now here is the solution

Jsoup :Public static String URL = " https://www.cardekho.com/maruti/swift-2018";
public static String cssSelector = "td.matches(mileage)";
Document document = Jsoup.parse(URL);

Element type = document.select(cssSelector).first();
return Mileage + " : " + type.siblingElements().first().text();

So the whole idea is to uniquely get key html element, and get its sibling text value. You might encounter few sites where you see nested html tag for key, but that also can be move as configuration to external source. So thats how you can make generic data extraction framework which gets data from various sites.

--

--