Creeping Crawlers

William L. Weaver
TL;DR Innovation
Published in
4 min readFeb 22, 2018

Autonomous Adaptive Data Mining Agents

Measurement scientists deal constantly with the problem of too much data. Faced with a large pile of ore, it is legitimate to ask how much of the desired material it contains. A reasoned sampling strategy coupled with the appropriate statistics can yield an average value that describes the entire collection without having to analyze every bit of it. After the ore is processed, the actual value can be compared with the average prediction. This procedure works well when the entire collection will be processed, but a more difficult question arises when it comes time to decide where in the ground to obtain the next load of ore. In this larger picture, each pile of processed ore represents a small sample of the entire region. Ore rich in the desired material frequently is not distributed randomly, but is located in neighborhoods of quality ore. Rather than merely commenting on the average content of the region, the successful mining operation can excavate along veins of superior ore without having to process entire mountains.

Photo by michael podger on Unsplash

This same problem arises in the realm of information. Data mining shares a strong analogy with material mining, as the valuable objects are distributed throughout a large collection of undesirable material and must be hunted, located and retrieved. Much like the large pile of ore that is processed entirely, search engines such as Google attempt to analyze and catalog every page of the World Wide Web and permit users to search their archived index for the information they are seeking. However, unlike traditional mining, the Web is a dynamic place whose content and size changes rapidly. An index-based search engine must continually expand its storage capacity and revisit pages often to catalog newly added and updated information.

To assist index-based search engines, several researchers are developing “topical-crawlers” — algorithms that surf the actual Web (not an index) looking for links to pages containing information relevant not only to a short query, but also similar in context to topics and user profiles. Different strategies are being pursued and each has its advantages. The first such crawler was developed in the mid-1990s, known simply as WebCrawler. It uses a Breadth-First algorithm that keeps a list of hyperlinks found on each Web page that it follows sequentially in search of query matches. Soon after, the Best-First algorithm incorporated an evaluation of the hyperlink list. Links determined to be the most promising are promoted to the top of the list and irrelevant links are removed to make room for new candidates.

The PageRank algorithm is used by Google to rate the importance of pages returned by a query of its index. The number of Web pages that link to a particular page increases its rank and promotes it toward the top of the query results. Topical-crawlers have successfully employed PageRank as a condition for the Best-First evaluation algorithm, but the rank of each page must be updated dynamically, requiring a large amount of recursive calculation and the creation of a cataloged history of visited pages. The Fish-Search algorithm is a Best-First type crawler that assumes relevant information is found in similar neighborhoods of the Web, much like following a vein of rich ore in a material mine. The crawlers search more exhaustively in areas of the Web where pertinent links are found and abandon searches that do not produce relevant pages.

With the goal of infusing computational intelligence into the Best-First search, Professor Filippo Menczer of Indiana University and his colleagues have developed the InfoSpiders algorithm. An adaptive population of crawlers is initialized with pages retrieved from a traditional search engine. The seed pages are retrieved and analyzed for the number of topic keywords appearing in the vicinity of each hyperlink contained on the page. The frequency scores are used as input to a neural network inside each crawler whose single output is the one link predicted to be the most valuable. Each InfoSpider performs a jump to its chosen link, and this new page is scored according to the number of topic keywords contained on it. A high score increases the crawler’s “energy level” while a low score decreases it. In this way the InfoSpider receives direct feedback of its performance and uses a back-propagation learning strategy to update the weights of its neural net. Agents whose energy falls below a set threshold are declared dead and removed from the population, while those agents with high energy reproduce according to a specific inheritance and mutation strategy intended to spawn offspring with enhanced abilities to find high-energy pages.

Professor Menczer’s research group has coded the InfoSpiders algorithm into an experimental prototype multi-threaded Java applet called MySpiders that is available for use at myspiders.informatics.indiana.edu[now at http://carl.cs.indiana.edu/fil/CAS/PPT/Menczer/] . With advancing research into autonomous adaptive data mining agents, perhaps the material mining industry can benefit from the technology and the announcement of a robotic exploration system called MyGophers may be on the horizon.

This material originally appeared as a Contributed Editorial in Scientific Computing and Instrumentation 22:4 March 2005, pg. 12.

William L. Weaver is an Associate Professor in the Department of Integrated Science, Business, and Technology at La Salle University in Philadelphia, PA USA. He holds a B.S. Degree with Double Majors in Chemistry and Physics and earned his Ph.D. in Analytical Chemistry with expertise in Ultrafast LASER Spectroscopy. He teaches, writes, and speaks on the application of Systems Thinking to the development of New Products and Innovation.

--

--

William L. Weaver
TL;DR Innovation

Explorer. Scouting the Adjacent Possible. Associate Professor of Integrated Science, Business, and Technology La Salle University, Philadelphia, PA, USA