Using web crawlers to Data-Mine for a highly specific Database build: A recent Case-Study
Key Words: Web Spider, Data mining, natural language processing, database design, Bot Building
This article describes a programme of work to locate, classify and build a unique Database underlying a new Digital media business. In this case we were asked to locate and structure a comprehensive and as far as possible complete database of businesses offering a particular kind of service. A new digital publishing enterprise was to be built upon this basic data platform which did not exist at that time and indeed which our customer thought would be difficult or even impossible to create. In order to create the new database we employed a variety of technologies and approaches which we describe below.
Central to any data-mining project is being able to access a sufficient amount of data that can be processed to provide meaningful and statistically relevant information. But acquiring the data is only the first phase and in this case where the project requires comprehensive coverage the problem is particularly hard. Not least because the data must first be collected in a relatively unstructured form, only then can it be transformed into a structured format suitable for processing and presentation to users.
In the case of novel applications such as the digital product we were supporting, it’s usually necessary to crawl the web to collect information. If the data mining process weren’t hard enough and it was, there are many challenges associated with crawling the web to discover and collect content which become increasingly difficult when doing this on a large scale amongst large data volumes with no rules as to content, content organisation or location.
Initially in this case, we used a structured search model to reduce the scale and variability of our data, which although it generated a fair degree of redundancy in the data our searches returned, did enable us to go on to create a highly structured output for our clients.
The first task we had was to understand the nature of the business activity we were seeking to capture and in particular to understand the nature of the industry; the way it operates and the key players involved. We quickly realised that the variety of practices and huge differences in the scale of the businesses and the activities we were seeking to capture and classify could only be addressed using a sophisticated multi-method approach. this initial desk research was backed up by a small number of interviews, this ‘soft’ research produced information which in the end was essential to the successful outcome of the project.
In this stage we characterised the industries we were interested in using standard industrial classification titles and a finite and well defined set of geographies. At the same time we created a corpus of key words associated with the business services we were interested in. These three sources of information were then combined and used by a search tool to find and create a superset of URL’s to feed into stage two of the process.
In this stage a web crawler examined the web sites generated in stage One and this time criteria were applied which allowed us to distinguish between different kinds of web site, If the keywords we were looking for were located in either the Header information or a Tab on the main page we added the web site to the a database of sites that could be visited again by a web crawler that was designed to retrieve the information from that class of site. If the key items of information we were searching for were only located in the body text of the site then they were classified differently and another tactic for retrieving information was followed.
This stage concentrated on analysing the text of web sites that discussed the activity we were concerned with but were not identified as suitable for inclusion in the final collection of web sites offering the service we were tasked with finding. Instead we used language processing techniques to find new ‘language’ cues that we could use to create a richer and more focussed set of search terms. These searches were then carried out again and the results albeit containing a lot of redundancy significantly added to the efficiency of our search.
A completely separate method was followed in order to increase the size of our target sample of URL’s used for the data harvesting phase. In this stage we used the public access API of the UK Twitter feed and constrained this by setting time limits. This database of tweets was then searched using a variety of approaches which reflected the many ways in which the activity we were concerned with might be mentioned by a host of different actors and agencies. The output from our searches of these data were URL’s that could be added to our final database of URL’s to be used for data harvesting. Once we had checked for multiple entries, the output (URL’s) from he different search strategies were added to a final database that was used for data harvesting.
We developed several custom spiders for crawling the pages. Depending on the category our filtering had assigned a URL to we either sent a general purpose spider to extract the information we wanted about the page, wrote more specific features for handling the unique aspects of the sources of data we were looking at or finally picked out a generic overview of what it was possible to know from otherwise data-sparse pages. Any data we extracted were simultaneously added to the database, with the unique key being the domain. NoSQL was used for data storage as some domains turned out to be the host for several target entities.
Cleaning and filtering was then performed and checks were performed to see if empty fields could otherwise be populated.
The first and most important insight and one that has already informed our subsequent work, is that time spent on thoroughly understanding the ecology of the target population of companies/business web sites and their goals was repaid in many ways. Not least the work we put in to classify and structure our understanding of the potential rage of web data sources made our effort to achieve both a complete and well structured dataset for our customers all the more successful. The comprehensive volume of data was further enhanced by a purposive mining of a twitter sample. Finally we have gone on to refine further the approach where we feedback into our search system the results of an interim classification of free text materials garnered from web sites. These three lessons, all of which have emerged in previous projects came to life in this particular context.
We have produced for our client the largest and most up-to-date dataset of of companies offering the services they have targeted that has ever existed. We are fortunate in that we could judge our work against previous attempts to harvest such material from he Internet. Our methods increased the volume of target businesses by around 40%, we were able to identify and structure more information from the source websites as well as finding more of them. Finally we have built a system that can operate semi-automatically to cover the enormous task of validating and discovering new links. These last services have laid the basis for a new live service where no such facility existed before.
Our customers have not yet launched their services but when they do we may be able to share some more details of the project in particular the use of natural language processing and the development of our bot control strategies which allow for the managed integration of a number of different spiders operating under a control hierarchy. Even more interesting will be a prospective research project to test the concept of building a conversational Bot to provide access to the database we have created — stay tuned.