Build your own Search Engine

A guide for installing “nutch” integrated with “solr” on Linux


Hi Everyone !!! A few months ago i have started working on a Q-A based Search Engine project. So the first task was to get as many as data. This was really a challenging task for me. I started looking into some crawler and came to knew about “nutch”. So what is “nutch”? it sounds weird, never heard of it.

Apache Nutch provides a solid web crawling solution that can easily snap-in to any backend data environment. In this case is really nice for seeding content to and from the Infinite platform which i will introduce in later post . In this post i will also cover how to get nutch on your local system.

# This process will consists of the following:
Installing Solr
Installing Nutch
Configuring Solr
Configuring Nutch
Crawling your site
Indexing our crawl DB with Solr
Search the crawled content in Solr
# Pre-requisite: Ubuntu 10.04 server or greater should be installed before working on installation process. Please login as root by typing following command. (similar setting can be followed for fedora also)
$ sudo su -
# Installing Solr
Luckily, solr 1.4.1 is present in APT! Get that by using command (for fedora you have to install it manually)
$ apt-get install solr-common solr-tomcat tomcat6
# Now please refer following steps to setup your tomcat manager which is very useful in future!
$ sudo apt-get install tomcat6-admin

# Now Edit the tomcat-users.xml

$ vi /var/lib/tomcat6/conf/tomcat-users.xml
<tomcat-users>
<!--
<role rolename="tomcat"/>
<role rolename="role1"/>
<user username="tomcat" password="tomcat" roles="tomcat"/>
<user username="both" password="tomcat" roles="tomcat,role1"/>
<user username="role1" password="tomcat" roles="role1"/>
-->
</tomcat-users>
# To this:
<tomcat-users>
<role rolename="tomcat"/>
<role rolename="role1"/>
<role rolename="manager"/>
<user username="tomcat" password="tomcat" roles="tomcat,manager"/>
<user username="both" password="tomcat" roles="tomcat,role1"/>
<user username="role1" password="tomcat" roles="role1"/>
</tomcat-users>
# Now, restart Tomcat:
$ sudo service tomcat6 restart
# You can access tomcat manager on http://localhost:8080/manager/html Username: tomcat Password: tomcat
# If things gone well you will see tomcat admin manager page.
# Installing Nutch
# Go to a proper working directory, download and unpack Nutch, i will use tmp folder
$ cd /tmp
# get the binary distribution of the nutch only not source distribution, here i will download nutch 1.7
$ wget http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz
$ cd /usr/share
$ tar zxf /tmp/apache-nutch-1.1-bin.tar.gz
$ ln -s apache-nutch-1.1-bin nutch
# Configuring Solr
# For the sake of simplicity we are going to use the example configuration of Solr as a base.
# Back up the original file:
$ mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig
# And replace the Solr schema with the one provided by Nutch
$ cp /usr/share/nutch/conf/schema.xml /etc/solr/conf/schema.xml
# Now, we need to configure Solr to create snippets for search results
$ vi /etc/solr/conf/schema.xml and change the following line:
<field name="content" type="text" stored="false" indexed="true"/>
To this:
<field name="content" type="text" stored="true" indexed="true"/>
# Create a new dismax request handler, to enabling relevancy tweaks.
Back up the original file:
$ cp /etc/solr/conf/solrconfig.xml /etc/solr/conf/solrconfig.xml.orig
# Add the following fragment to _/etc/solr/conf/solrconfig.xml_:
$ vi /etc/solr/conf/solrconfig.xml
<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2&lt;-1 5&lt;-2 6&lt;90%
</str>
<str name="ps">100</str>
<str name="hl">true</str>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
# Now, restart Tomcat:
$ sudo service tomcat6 restart
# Configuring Nutch
# Go into the nutch directory and do all the work from there:
$ cd /usr/share/nutch
# Edit conf/nutch-site.xml and add the following in between the <configuration>-clauses:
$ vi conf/nutch-site.xml
<property>
<name>http.robots.agents</name>
<value>nutch-solr-integration-test,*</value>
<description></description>
</property>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration-test</value>
<description>Viterbi Bot</description>
</property>
<property>
<name>http.agent.description</name>
<value>Viterbi Web Crawler using Nutch 1.0</value>
<description></description>
</property>
<property>
<name>http.agent.url</name>
<value>http://viterbi.usc.edu/</value>
<description></description>
</property>
<property>
<name>http.agent.email</name>
<value>YOUR EMAIL ADDRESS HERE</value>
<description></description>
</property>
<property>
<name>http.agent.version</name>
<value></value>
<description></description>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
# You need to ensure that the crawler does not leave our domain, otherwise you would end up crawling the entire Internet.So you need to insert domain into conf/regex-urlfilter.txt:
$ vi conf/regex-urlfilter.txt
# allow urls in viterbi.usc.edu domain
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/([a-z0-9\-A-Z]*\/)*

# deny anything else
-.
**Important: Make sure that you
Edit this:
# accept anything else
+.
To this:
# accept anything else
#+.
# Now, we need to instruct the crawler where to start crawling, so create a seed list:
$ mkdir urls
$ echo "http://viterbi.usc.edu/" > urls/seed.txt
**Important:
Here you can add multiple seed urls per line and make sure that you make corresponding changes in regex-urlfilter.txt discussed above
# Crawling your site
# Let's start crawling!
# Start by injecting the seed url(s) to the nutch crawldb:
$ bin/nutch inject crawl/crawldb urls
# Next, generate fetch list:
$ bin/nutch generate crawl/crawldb crawl/segments
The above command generated a new segment directory under /usr/share/nutch/crawl/segments that contains the urls to be fetched. All following commands require accessing the latest segment directory as their main parameter so we’ll store it in an environment variable:
$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
# Launch the crawler!
$ bin/nutch fetch $SEGMENT -noParsing
# And parse the fetched content:
$ bin/nutch parse $SEGMENT
# Now we need to update the crawl database to ensure that for all future crawls, Nutch only checks the already crawled pages, and only fetches new and changed pages.
$ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
# Create a link database:
$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
**Important:
The more number of times you repeat above crawling steps you will get better crawl depth!
#Indexing our crawl DB with solr
$ bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
# Search the crawled content in Solr
# Now the indexed content is available through Solr. You can try to execute searches from the Solr admin UI from
http://127.0.0.1:8080/solr/admin
or directly with url like:
http://127.0.0.1:8080/solr/select/?q=usc&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
# finally if it is not working then you have to look into the jar files corresponds to server of solr and nutch is same or not
$ cd /usr/share/nutch/lib
$ ls
# check for jar name apache-solr-solrj-1.4.1.jar or solr-solrj-1.4.1.jar
the same jar should be there in solr libraries
$ /usr/share/solr/WEB-INF/lib
$ ls
if not then copy that from solr lib to nutch lib and remove the other version.

Finally it will work.

Cheers !

Email me when Rahul Ranjan publishes or recommends stories