Web scraping and indexing with StormCrawler and Elasticsearch

Naween Fonseka
Nov 7, 2019 · 8 min read

Most recently I’ve started working on a requirement which is to add a search component to each and every website that customers are publishing through our application. In order to add a search component to a website, the challenging task would be to scrape the website data and index them accordingly. At this juncture I did some research on the existing web crawlers, search indexes and based on the results, I have decided to go with StormCrawler which is an open source collection of resources runs on Apache Storm and Elasticsearch to index the crawled data. Herewith I’ll be sharing my experience how I have achieved the required results.

In this tutorial/article I have used following libraries and resources to complete the task to achieve the required results.

  1. StormCrawler (v1.15)
  2. Elasticsearch (v7.1.0)
  3. Apache Storm (v1.2.3)
  4. Apache Maven (v3.6.2)

P.S : StormCrawler (v1.15) supports the latest version of Elasticsearch (v7.4.0). However, I have decided to use Elasticsearch v7.1.0 for this tutorial because at the time of writing this tutorial AWS only supports Elasticsearch v7.1.0. (I have used AWS’s SQS to feed-in the website domains to StormCrawler and I have shared my experience in this article)

Things that will be covered in this article

  • Set up a StormCrawler project with Elasticsearch configurations
  • Running StormCrawler with Apache Flux in local mode (injector and crawler separately)

Prerequisite

  • Apache Storm added as a PATH variable
  • Elasticsearch running on a local server
  • Apache Maven installed in your machine

Set up StormCrawler

StormCrawler has a quick setup process and can be configured using Apache Maven. Following maven command will create a maven project with the basic configurations of StormCrawler.

Above command will prompt you to enter a groupId, an artifcatId an you may skip the rest for now. I have given com.cnf271 and stormcrawlertest as the groupId and artifactId respectively.

Above command will create a good looking folder as given in the figure below.

Initial Project Files

First Things First,

In order to work with Elasticsearch, we may need to delete some existing files and replace them with Elasticsearch compatible StormCrawler files. (Please note that, above configurations are good enough to do a proper web crawling and we are deviating from the basic StormCrawling as our end goal is to feed-in the crawled data to Elasticsearch).

Go ahead and clone the StormCrawler GitHub repository to a separate folder. All we need is few configuration files from repo’s /external/elasticsearch folder. Copy the following files from /external/elasticsearch to previously created stormcrawlertest folder.

  • ES_IndexInit.sh
  • es-conf.yaml
  • es-crawler.flux
  • es-injector.flux
  • Kibana (Folder)

And delete the following file from stormcrawlertest folder.

  • crawler.flux

Now create a text file called seeds inside the stormcrawlertest directory.

After the alterations, stormcrawlertest folder should look like the below image.

StormCrawler with Elasticsearch configuration files

Briefly explaining each and every file’s sole purpose.

  • ES_IndexInit - Bash script containing information about the indexes we are going to create in Elasticsearch.
  • es-conf.yaml - Contains information about configurations for Elasticsearch resources. (For now, will stick with the default configurations)
  • es-injector.flux - Contains how URLs are injected to Elasticsearch search index. Initially, there is one spout (FileSpout — Reads URLs from a text file) and one Bolt (StatusUpdaterBolt — updates the status index after each and every inject in injector flux file.
  • es-crawler.flux - Does the actual crawling part. Contains one spout (AggregationSpout— checks and retrieve URLs from Elasticsearch server to crawl) and several bolts (Several bolts to extract URLs, Fetch Data and index the content etc.)
  • seeds.txt - Text file containing URLs of the website we are about to crawl (In this tutorial we will be telling the StormCrawler to use the URLs given in the seeds.txt) Go ahead and add a URL to the text file. To start with the URLs I have added http://www.mit.edu/ to seeds.txt file.
  • crawler-conf.yaml- contains default configurations for StormCrawler.

Change the default configurations of ES_IndexInit bash script

By default, ES_IndexInit bash script has disabled the content saving option, when storing data in Elasticsearch index. Further, _source has been disabled as well by default. You may want to change the configurations as follows, to get the StormCrawler to store the content data and display them in the query result.

Content Storing Enabled bash Script

Adding StormCrawler Elasticsearch dependency

Go ahead and add the following dependency to pom.xml file.

Creating Relevant Indexes in Elasticsearch

Before stepping into the creation of maven project, we may need to create several indexes in Elasticsearch. That can be done by executing the ES_IndexInit bash script. Executing the ES_IndexInit bash script help us creating and deleting existing indexes we have added in Elasticsearch.

What is ES_IndexInit bash script and its purpose ?

ES_IndexInit bash script contains several curl commands which creates and deletes (initially) existing search indexes. You may change them accordingly.

Bash script will create following indexes in ES,

  • Content - contains the actual content crawled by the StormCrawler along with a their host, title and url etc.
  • Status - creates an index in ES which gives a status of the website domain that has been injected/crawled to ES. For ex: DISCOVERED, FETCHED, REDIRECTION, FETCH_ERROR and ERROR. More on status index here.
  • Metrics - contains metrics provided by Apache Storm and StormCrawler itself, such as worker id which was used to crawl the url, worker host etc.

Execute the bash script by simply running the following command inside the project directory.

Once the script has been executed, following browser URLs can be used to check the status of above indexes.

Building the Maven Project

Once the project has been setup with the required dependencies and search indexes have been created in ES server, we can create the uberjar by simply executing following command in the project directory.

Now that the project artifact has been created, what’s remaining is to inject and craw the Websites we have added in the seeds.txt .

Injecting URLs to Elasticsearch

What is actually meant by injecting URLs to Elasticsearch ?

By injecting the URLs to Elasticsearch, a spout will be used to retrieve URLs from a particular source and send them to ES server for discovery.

In our case, we will be using FileSpout, which will read the URLs from a text file and send them to the ES server. Further, several other spouts can be used to perform the task depending on the requirement. For example, MemorySpout can be used when URLs are stored in memory.

We will be using Apache Flux to inject our URLs to the Elasticsearch server. Apache Flux is a framework that can be used to deploy our application using Apache Storm.

Go ahead and execute the following command in the project directory,

As we are running the injector in local mode, Apache Flux uses a default Time To Live (TTL) of 60 seconds. Injector will try to inject maximum number of URLs from seeds.txt to ES server during this period. However, in our case 60 seconds is more than enough to inject two URLs to the ES server. In the initial inject process, Injector topology will only try to discover the root URLs we have added in the text file.

Once the injector has been successfully ran for 60 seconds, we can check the status index for discovery status of the URLs.

Crawling the injected URLs

Once the URLs have been injected to the ES server, Crawler topology can be used to crawl the pages depending on the crawl status and nextfetchDate. Aggregation spout in Crawler flux will retrieve the URLs from status index and will start crawling them. Same as the Injector, Crawler topology will only run for 60 seconds in the local mode. In order to scrape an entire website, you may need to run the crawler flux multiple times.

Following command will start the crawler in local mode.

After a successful run, you can check the status and content indexes for actual results. You may notice that, statuses of the URLs that have been injected earlier has been changed to FETCHED status, which means content in the given URL haven been successfully crawled and indexed to ES server by StormCrawler.

Snippet of the content index is given below.

As you can see, during the initial crawling process, StormCrawler was able to crawl 7 contents from http://www.mit.edu/ website. If you run the crawler again, you may notice an increment in the total hits count.


Conclusion

In this article, I have discussed how to setup a StormCrawler project with Elasticsearch configurations. Further, I have explained how ES_IndexInit bash script helps us in creating relevant indexes in Elasticsearch and what are the changes that needs to be done to the script, in order to store the content. I have added the sample project in my GitHub Repository for your reference.

In my next article I have discussed, how I have managed to run the injector and crawler continuously in local mode without a storm cluster. Further, I have explained how AWS’s SQS can be used to feed-in the URLs to the injector.

References

  1. http://stormcrawler.net/
  2. https://www.elastic.co/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Naween Fonseka

Written by

Software Engineer

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

More from Analytics Vidhya

The Illustrated Word2vec

More from Analytics Vidhya

More from Analytics Vidhya

Financial Transaction Fraud Detection

218

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade