Web crawler is an internet bot which is used to discover web resources (web pages) from world wide web (WWW). It is mainly used by web search engines such as Google, Bing in order to build and update search indices. Usually crawler starts crawling with known seed URLs and then use fetched pages to find URLs for subsequent crawling.
In this article we are going to look at the things given below,
- features that crawlers must provide
- features that crawlers should provide
- how to run sample web crawler using crawler 4j
- evaluate functionality of crawler 4j with respect to expected features
- focused crawling with crawler 4j
MUST HAVE FEATURES
There are few features any crawler must provide in order to search web resources effectively as given below,
- Implicit Politeness: Crawler must be polite such that it should avoid hitting any site too often. Excess usage of site resources by crawlers may result in denial of service for legitimate users.
- Explicit Politeness: Usually site administrators use robots.txt file to indicate which parts of the site should not be accessed by web crawlers and crawler must always adhere to it. Ill behaved crawlers can be banned by site administrators.
- Robustness: Crawler should be immune to malicious behaviors of web servers such as spider traps and spam pages. All traps are not malicious but are due to faulty website development.
SHOULD HAVE FEATURES
There are few features any crawler should possess in order to search web resources effectively as given below,
- Be capable of distributed operations: crawler should be designed to run on several machines in a distributed manner.
- Be scalable: crawler should be designed to increase the crawl rate by adding more machines.
- Be efficient: crawler should permit full use of available processing such as processor, storage and network bandwidth.
- Continuous operation: crawler should crawl pages continuously such that data freshness is preserved.
- Extensiveness: crawler should extensible in order to handle new data formats, fetch protocols etc.
Crawler4J is an open source web crawler for java. It distributes under Apache 2.0 license. IntelliJIdea, Maven and java are required to follow below steps.
- New java project can be created in IntellijIdea using maven as given in below images,
2. Following dependency should be added to pom.xml,
3. Custom crawler class should be added by extending the WebCrawler class provided by Crawler4j framework. Two methods, “shouldVisit” and “visit” methods should be overridden to provide expected behavior for the crawler as given in below code snippet.
4. Main class is created as given below to configure and start the crawler as given below,
5. Once you run the program, crawler will start to crawl web starting from the seed URL. Crawler output will be as follows,
Crawler4j achieves politeness by using a variable called “politenessDelay”. Default value of this is set to 200 milliseconds. User can tune this according to their requirement. Crawler4j will wait at least the amount specified in “politenessDelay” between requests.
Crawler4j evaluates each fetched URL against the robots.txt file of corresponding host.
Crawler4J achieves robustness by crawling sites according to given robots.txt. It minimizes the chance of getting trapped. Also user can specify a limit for the depth of crawling from seed page. By doing so, chances of getting trapped can be reduced to a greater extent.
Crawler4j natively does not support distributed operations. But there are extended projects written based on crawler4j which support distributed operations.
Even though Crawler4j does not support distributed operations it can be scaled up by adding multiple threads.
Performance and efficiency of Crawler4j can be increased by increasing number of crawler threads and reducing politeness delay
Focused Crawling with Crawler4j
Usually seed web page and other web pages have a lot of outgoing links. So if we allow web crawler to crawl freely, it will crawl large amount of unnecessary pages. To focus crawler to crawl only required pages, following things can be followed,
- add an appropriate filter to should visit method: in above example, crawler will not download pages with unnecessary extensions.
- set an appropriate crawl depth: in above example, crawler depth is set to one. So that crawler will only consider one level of depth from seed page for crawling.