Scalable Web Crawling using StormCrawler and Apache Solr

7 min readSep 12, 2019

In this post, I am going to write a web crawler that will scrape data from some websites and store their content in Apache Solr . But before I get into the code, here’s a brief intro to StormCrawler itself.

What is StormCrawler?

From stormcrawler.net:

StormCrawler is an open source SDK for building distributed web crawlers based on Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.
The aim of StormCrawler is to help build web crawlers that are :
scalable
resilient
low latency
easy to extend
polite yet efficient

Installing Requirements

Before get to the StormCrawler itself, we need to install some dependencies that StormCrawler works based on them. If you have already installed Zookeeper 3.4.14 and Storm 1.2.2, you can skip following steps on go the creating a StormCrawler project.

Note: I’m installing these dependencies on Ubuntu 18.04.3 LTS.

Step 1 — Verify Java:

Use the following command to check whether you have Java already installed on your system.

$ java -versionopenjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

If Java is already there, then you would see its version number. Else, download and install the latest version of JDK.

Step 2 — Install Apache Solr:

Now download the required Solr version from its official site or mirrors. Or simply use the following command to download Apache Solr 8.2.0.

$ sudo su# cd /opt# wget https://downloads.apache.org/lucene/solr/8.6.2/solr-8.6.2.tgz

Now extract Apache Solr service installer shell script from the
downloaded Solr archive file and run the installer using following
commands:

# tar xzf solr-8.6.2.tgz solr-8.6.2/bin/install_solr_service.sh --strip-components=2# bash ./install_solr_service.sh solr-8.6.2.tgz$ sudo service solr start

To make sure that Apache Solr is installed and started, open http://localhost:8983 in your browser and you should see Solr web admin panel.

Step 3 — Install Apache ZooKeeper:

To install ZooKeeper framework on your machine, visit the following link and download ZooKeeper . In this project we are using ZooKeeper 3.4.14 (ZooKeeper-3.4.14.tar.gz). http://zookeeper.apache.org/releases.html

Extract the tar file using the following commands:

$ cd /opt$ sudo su# wget http://mirror.23media.de/apache/zookeeper/zookeeper-3.4.14/zookeeper-3.4.14.tar.gz# tar -zxf zookeeper-3.4.14.tar.gz# cd zookeeper-3.4.14# mkdir data

Open configuration file named conf/zoo.cfg using the command nano conf/zoo.cfg and setting all the following parameters as starting point.

# nano conf/zoo.cfgtickTime=2000
dataDir=/opt/zookeeper-3.4.14/data
clientPort=2181
initLimit=5
syncLimit=2

Once the configuration file has been saved successfully, you can start the ZooKeeper server. Use the following command to start the ZooKeeper server.

# bin/zkServer.sh start

After executing this command, you will get a response as follows:

ZooKeeper JMX enabled by default
Using config: /opt/zookeeper-3.4.14/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Step 4 — Install Apache Storm:

To install Storm framework on your machine, visit the following link and download Storm . In this project we are using Storm 1.2.2 (Storm-1.2.2.tar.gz). http://storm.apache.org/downloads.html

Extract the tar file using the following commands:

$ wget https://archive.apache.org/dist/storm/apache-storm-1.2.2/apache-storm-1.2.2.tar.gz$ tar -zxf apache-storm-1.2.2.tar.gz$ cd apache-storm-1.2.2$ mkdir data

The current release of Storm contains a file at conf/storm.yaml that configures Storm daemons. Add the following information to that file.

$ nano conf/storm.yamlstorm.zookeeper.servers:
 - "localhost"
storm.local.dir: "/path/to/apache-storm-1.2.2/data"
nimbus.host: "localhost"
supervisor.slots.ports:
 - 6700
 - 6701
 - 6702
 - 6703

Adding to PATH

$ STORM_HOME="/path/to/apache-storm-1.2.2"
$ export PATH=$PATH:$STORM_HOME/bin

Creating a StormCrawler Project

We are going to start a StormCrawler project on a single node. To do so, we have to clone the whole project from its repository on github in any directory you want:

$ git clone https://github.com/DigitalPebble/storm-crawler.git

NOTE: These instructions assume that you have Apache Maven installed.

$ sudo apt install maven
$ mvn -v

We can now generate a brand new StormCrawler-based project using :

$ mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.14

After downloading all needed packages, it will ask you a groupId, you can name it whatever you want, but here we name it com.dipe. and for artifactId we name it solrDemo. For ‘ version’ 1.0-SNAPSHOT and ‘package’ com.dipe we just hit enter.

After creating project, this is what inside of solrDemo directory:

crawler-conf.yaml crawler.flux pom.xml README.md src

Configure Your Crawler

At the first place we should configure Solr as our backend for storing data. To do so, add the following lines to pom.xml in <dependencies></dependencies> section:

<dependency>
   <groupId>com.digitalpebble.stormcrawler</groupId>
   <artifactId>storm-crawler-solr</artifactId>
   <version>${stormcrawler.version}</version>
</dependency>

Then copy solr-conf.yaml from the directory you have cloned before to the project directory. This is a command you can use to achieve this:

$ cp storm-crawler/external/solr/solr-conf.yaml solrDemo/

Now we start Solr using StormCrawler configurations:

$ sudo service solr stop$ sudo /opt/solr-8.2.0/bin/solr start -force -s your/path/to/storm-crawler/external/solr/cores

If starting Solr with those configs was successful, you can see the following cores in Solr Web Admin Panel.

docs
metrics
status

Then add the following lines to crawler.flux in includes section in solrDemo directory:

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false- resource: false
      file: "crawler-conf.yaml"
      override: true- resource: false
      file: "solr-conf.yaml"
      override: true

And bolts section should looks like this:

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "feed"
    className: "com.digitalpebble.stormcrawler.bolt.FeedParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
  - id: "index"
    className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
    parallelism: 1

Also you can write the websites you want to crawl in the this file:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
    parallelism: 1
    constructorArgs:
      - ["http://www.lequipe.fr/", "http://www.lemonde.fr/", "http://www.bbc.co.uk/", "http://storm.apache.org/", "http://digitalpebble.com/"]

In crawler-conf.yaml set the following config:

http.content.limit: -1

If you want to crawl images and videos or other medias urls, comment out or remove the following lines in /solrDemo/src/main/resources/default-regex-filters.txt file, otherwise skip these steps.

-(?i)\.(apk|deb|cab|iso|gif|jpg|png|svg|ico|css|sit|eps|wmf|rar|tar|jar|zip|gz|bz2|rpm|tgz|mov|exe|jpeg|jpe|bmp|js|mpg|mp3|mp4|m4a|ogv|kml|wmv|swf|flv|mkv|m4v|webm|ra|wma|wav|avi|xspf|m3u)(\?|&|$)

And then add the following lines to /solrDemo/src/main/resources/ parsefilters.json file:

{
      "class":        "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
      "name": "LinkParseFilter",
      "params": {
      "pattern": "//IMG/@src"
      }
}

You can set the depth of your crawler in /solrDemo/src/main/resources/ urlfilters.json file, I set this value to 2 but you can change it to any value you want. Make sure this file looks like this:

{
 "com.digitalpebble.stormcrawler.filtering.URLFilters": [
  {
   "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter",
   "name": "BasicURLFilter",
   "params": {
    "maxPathRepetition": 8,
    "maxLength": 8192
   }
  },
  {
   "class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
   "name": "MaxDepthFilter",
   "params": {
    "maxDepth": 2
   }
  },
  {
   "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer",
   "name": "BasicURLNormalizer",
   "params": {
    "removeAnchorPart": true,
    "unmangleQueryString": true,
    "checkValidURI": true,
    "removeHashes": false
   }
  },
  {
   "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
   "name": "HostURLFilter",
   "params": {
    "ignoreOutsideHost": false,
    "ignoreOutsideDomain": false
   }
  },
  {
   "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer",
   "name": "RegexURLNormalizer",
   "params": {
    "regexNormalizerFile": "default-regex-normalizers.xml"
   }
  },
  {
   "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter",
   "name": "RegexURLFilter",
   "params": {
    "regexFilterFile": "default-regex-filters.txt"
   }
  }
 ]
}

Now everything is done and we can compile our project using the following command:

$ mvn clean package

When compile operation is done, you can now run the crawler by the following command:

$ storm jar target/solrDemo-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 86400000

If you see the below results, it means it was successful:

21629 [FetcherThread #0] INFO  c.d.s.b.FetcherBolt - [Fetcher #4] Fetched http://storm.apache.org/ with status 200 in msec 128
21641 [Thread-28-parse-executor[6 6]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting http://storm.apache.org/
21912 [FetcherThread #1] INFO  c.d.s.b.FetcherBolt - [Fetcher #4] Fetched http://www.bbc.co.uk/ with status 301 in msec 138
22308 [Thread-28-parse-executor[6 6]] INFO  c.d.s.b.JSoupParserBolt - Parsed http://storm.apache.org/ in 491 msec
22513 [FetcherThread #49] INFO  c.d.s.b.FetcherBolt - [Fetcher #4] Fetched http://digitalpebble.com/ with status 200 in msec 237
23411 [Thread-28-parse-executor[6 6]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting http://digitalpebble.com/
23518 [Thread-28-parse-executor[6 6]] INFO  c.d.s.b.JSoupParserBolt - Parsed http://digitalpebble.com/ in 74 msec
23796 [FetcherThread #4] INFO  c.d.s.b.FetcherBolt - [Fetcher #4] Fetched http://www.lemonde.fr/ with status 503 in msec 263
25112 [FetcherThread #8] INFO  c.d.s.b.FetcherBolt - [Fetcher #4] Fetched http://www.lequipe.fr/ with status 301 in msec 776

Now you can see the status of crawling, in status core of Solr Web Admin Panel, like below:

As you can see, we have discovered all URLs in those websites. Now it’s time to fetch data. To do so, we have to change the crawler.flux like below:

We remove this:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
    parallelism: 1
    constructorArgs:
      - ["http://www.lequipe.fr/", "http://www.lemonde.fr/", "http://www.bbc.co.uk/", "http://storm.apache.org/", "http://digitalpebble.com/"]

And write this:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
    parallelism: 1

Now we run the previous command again:

$ storm jar target/solrDemo-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 86400000

After crawling operation is finished successfully, we can see the content of each URL in docs core in Solr Web Admin Panel.

I hope you enjoy this tutorial and was helpful. If you have any questions or thoughts on the tutorial, feel free to reach out in the comments below.

Resources

GitHub repository for this project