Paper: “IRLbot: Scaling to 6 Billion Pages and Beyond”
Two of the most complex issues to deal with when developing a crawler are URL uniqueness and host politeness. When crawling, you need to visit new pages. In order to know which URL represents a new page, you have to do a look up over…