Automation of resolving CAPTCHAs for web crawling

Abhay Bhagat
5 min readJun 24, 2016

The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present. Due to the abundance of data on the web and different user perspective, information retrieval becomes a challenge. When a data is being searched, hundreds and thousands of results appear. So for user’s convenience the search engines have a bigger job to play. Search engines sorts the result, in the order of interestingness of the user. Links with highest probability of providing useful information to user places as per rank on search result’s page and a quick summary of the information provides on a page.

Web crawling is the process used in above scenario by which search engines collect internet data and stores in database. This crawling phase encounters some problems. CAPTCHA solving is one of this due to which search engines experiences performance bottleneck such as slow speed of data collection, not having updated data of web sites in database etc. Following section will give brief description about problem concept.

Overview

CAPTCHA is basically image with characters which are supposed to recognize and enter for verification. There are many types of CAPTCHAs such as audio CAPTCHAs, still image CAPTCHAs, moving display CAPTCHAs, computational CAPTCHAs etc. Also different CAPTCHAs have different number of characters, numbers and special symbols with different font and size of characters, different colors to characters as well as background. Thus solving such CAPTCHAs by automated process needs image processing with sum complex operation. Finally solved CAPTCHA i.e. recognize characters get feed to crawling process with minimum time elapse and data collection process continues without delay.

Web crawlers are programs which browse www in methodical and automated way. Web crawler creates copy of all the visited pages for later processing by a search engine. This process is iterative, as long the results are in closed proximity of user’s interest. Search engines use this algorithms which sorts and ranks the result in the order of authority that is closer to the user’s query. Many algorithms are is in use — Breadth first search, Best first search, Page Rank algorithm, Genetic algorithm, Naïve Bayes classification algorithm to mention a few. There are important characteristics of the Web such as its large volume, dynamic page generation that make crawling very difficult. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. Performance of web crawler based on freshness and age. When the same copy exists in the local as well as the remote sources, then it is considered to be the “fresh” page. Freshness focus on whether or not the local copy is the current copy of the resource. Age focus on how long ago the local copy was updated. The freshness drops to zero when the real-world element changes and the age increase linearly from that point on. When the local element is synchronized to the real-world element, its freshness recovers to one, and its age drops to zero.

Internet browsers and BigData companies use web crawling software to keep their database update where ultimate goal is to provide best services to internet users. For this crawler need more frequent crawling to reduce age and improve freshness of page. On the other side web servers may be overloaded as it has to handle the requests of the viewers of the site as well as the web crawler. The politeness policy of crawler is used so that the performance of a site is not heavily affected while the web crawler downloads a portion of the site. Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. As the speed of crawler is very fast considering multiple requests per second server often get overwhelmed. Thus in order to slow down requests from same machine and to check whether that machine is not a malicious trap who generally attack systems by sending multiple request until it get crash(also called as DOS attack), system asks machine to solve CAPTCHAs. In the regular crawler to pass CAPTCHA authentication phase human intervention is requires as crawlers are not embedded with automatic CAPTCHA resolving programs. If human involve in whole web crawling process to resolve CAPTCHA one also need to consider the delay while crawling. By taking into account tremendous size of internet sites it is not feasible for human being to solve and enter CAPTCHA for each web server very often. This will significantly reduce the efficiency of crawling software. Also it will be more prone to error while considering human nature.

Scope

The Automation of resolving CAPTCHAs application is to be developing for speeding up the web crawling process used by web servers and database companies. The Automation of detecting CAPTCHAs application using OpenCV is supposed to have the following features:

· Application should differentiate between CAPTCHAs and other images on web pages.

· It should recognize the character within range of a-z, A-Z, 0–9.

· CAPTCHA should be in png, jpg, jpeg or bmp format.

Test Cases

What we have done here is finding a solution to solve most commonly used text based CAPTCHAs. Some of those types with step for solution are given below.

Algorithms used for this are explained in following blog:

The text resolving techniques can be efficiently utilized in resolving CAPTCHAs for web crawling. Ultimately results in reducing the crawling process time.

Automation of resolving CAPTCHAs for web crawling software gives significantly high success rate. Still there are some areas for improvement in the implemented scope of this project. Like programmer can train system for characters other than mentioned in the scope like regional language characters and extra symbols e.g. hash, slash, dollar, brackets etc. Also system can be extended for the purpose of reading text for blind users; automatic licenses plate recognition for road traffic monitoring, suspicious driver detection and for other BigData management companies applications.

Complete paper is available on below link:

--

--