Sitemap
Geeks Prep

This publication contains a few of my thought, and also tutorials.

Your first steps to building a web crawler: Integrating Nutch with Solr.

--

Special thanks to Ridwan Naibi Suleiman for exposing me to nutch and solr. And also for helping me debug my first setup of nutch and solr. He also encouraged me to start writing on medium and he is the editor of this blog.

from pixels.com

Unlike other tutorials for both complex and simple set ups, setting up Nutch on a Linux machine is not straightforward even when you follow the official tutorial. Getting Solr and Nutch to working on my machine was a nightmare but I am going to simplify it for you here:

Note: You need to have Java installed on your machine, as both Nutch and Solr are dependent on it.

Firstly, let us get Solr installed

Head over to the Solr website and download the Solr binary release:

Solr download page

At the time of writing this tutorial, Solr is at version 8.6.0. However, My current version of Solr is 8.5.2. This tutorial should work for both versions.

Once you have that downloaded, you should extract it in any folder of choice. I like to do this in my home folder.

So, you should end up with this:

/home/username/solr-8.5.2

To start Solr, navigate to Solr home directory and run bin/solr start to start Solr.

cd ~/solr-8.5.2
bin/solr start

Roll up your browser to access the Solr web app using the URL: localhost:8983/solr

Next, we want to get Nutch installed.

The first step in installing Nutch follows the same approach as with Solr. Head over to Apache Nutch home and grab for yourself the Nutch package.

You will notice that apache maintains two versions of Nutch. The 1.x version and the 2.x version. The 2.x version seems to be the future of Nutch going forward and offers much flexibility when it comes to which database to use. However, the 1.x version is the more mature and established one. I am using the 1.x version for some of these reasons. And by the way, here is an excerpt from the official website:

It is assumed that you have a working knowledge of configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is important to take this into consideration before progressing any further.

SO, just grab Nutch v1.17 source package. Extract the package in your home directory. So you have:

/home/username/apache-nutch-1.17

Because this is the source package, let us compile it:

/home/username/apache-nutch-1.17/ ant

Note: you need to have ant installed. If you don’t, run sudo apt-get install ant to install it.

The compilation process will take several minutes to finish because all dependencies need to be downloaded and compiled accordingly.
When everything is done, you’ll find a runtime/local folder created for you in the Nutch directory. This will be your new Nutch home folder.

For simplicity,

$NUTCH_HOME will now refer to apache-nutch-1.17/runtime/local
and $SOLR_HOMEwill refer to the Solr folder we’ve extracted earlier.

To test if Nutch was installed correctly, navigate to $NUTCH_HOME and run bin/nutch

You should get a list of commands that Nutch supports.

Since we have both Solr and Nutch installed, it’s now time to get them working together.

Let’s start with Solr. Firstly, we need to create resources that will be used by our Solr core.

Okay, what is a core? A Solr core is a standalone instance of a Lucene index. What this means is, to use Solr, you will need to create something called a core that will run in order to perform your operations. This core will need its own resources like configurations, schema, and the rest. You can create as many cores as possible on a single machine. This makes sense because different data sources might require different configurations. Like in our case, Nutch will have a way it structures data, which might be different from another data source. In this light, specific cores should have resources unique to them.

All resources for Solr cores are to be placed in the $SOLR_HOME/server/solr/configsests directory.
So we are going to create a folder in this directory to place our resources for our new core:

mkdir -p $SOLR_HOME/server/solr/configsets/nutch/

So we now have an empty nutch folder waiting for configuration files. Solr comes with default configuration files that work for almost all cores you will be creating. These files are located in the $SOLR_HOME/server/solr/configsets/_default directory. So we will copy these files to our new nutch folder:

cp -r $SOLR_HOME/server/solr/configsets/_default/* $SOLR_HOME/server/solr/configsets/nutch/

This is the time to define Nutch specific configurations.
Firstly, we need to tell Solr how our data is to be structured. We also define what types are our fields going to be and how Solr handle our query values. This will all be done in a schema.xml file. Check $NUTCH_HOME/conf for this file or download the most recent one here.

Copy this schema.xml file to $SOLR_HOME/server/solr/configsets/nutch/confdirectory.

So is Solr aware of how to handle our Nutch data?
Not quite. By default, Solr will ignore your schema configurations for some managed-schema if it is found. See it as the favorite pet we need to get rid of in order to get our own pet considered. So let’s do that:

rm $SOLR_HOME/server/solr/configsets/nutch/conf/managed-schema

Now that we have it out of the way, we can start our Solr server.

Navigate to $SOLR_HOMEand run:

bin/solr start

If you have it already running, you’ll need to shut it down and restart again. bin/solr stopdoes that.

We now have resources set up for a core that does not exist. Let’s fix that by creating our nutch core:

$SOLR_HOME/bin/solr create -c nutch -d $SOLR_HOME/server/solr/configsets/nutch/conf/

The command above creates a core called nutch and tells Solr where to find the configurations for the nutch core.

You can check things out in your browser with localhost:8983/solr. You will see your newly created core sitting happily in the core section.

The next question is, does Nutch know about Solr at all?
Well, that has been handled for us inindex-writer.xml file in the $NUTCH_HOME/conf directory.

An index-writer is a component used to send Nutch crawled data to an external server. It is important to note that Solr and Nutch run independently and the Solr server is not the only server you could push Nutch data to. index-writer.xml handles that for us.

You will notice that there are other indexers aside from the Solr one. These are kind of added by default and will be ignored since we don’t have such servers running.

So What’s next? Crawl some websites.
But before we do that, we need to give our bot/crawler a name. Let’s called it Nutch Crawler. We can do this in the $NUTCH_HOME/conf/nutch-site.xml file. open that file and add these lines:

<?xml version=”1.0"?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Crawler</value>
</property>
<property>
<name>http.agent.email</name>
<value>datalake.ng at gmail d</value> </property>
</configuration>

With this in place, any website that our bot is going to crawl will know what to call our bot with. You have to set your crawler's name and then a contact email is a nice to have in the case that your crawler is abnormally bogging websites down.

Next, we need to make a list of URLs we want our bot to crawl. This list should be placed in $NUTCH_HOME/urls/seed.txt file, with each URL in its own line.
So we’ll create the urls folder in our $NUTCH_HOME folder:

mkdir -p $NUTCH_HOME/urls

We’ll also create that `seed.txt` in the folder we just created

touch $NUTCH_HOME/urls/seed.txt

Once done, you can insert as many URLs as you want to crawl in the file. For starters, let’s crawl Nutch official website http://nutch.apache.org. So our file is going to contain the URL.

One catch though, if we should crawl this URL, we don’t just end up with content from this URL alone. We also get data from all the external links that the crawler finds in the Nutch website. For instance, if there is a https://someurl.com written on any page in the Nutch official website, we are also going to get the contents of https://someurl.com returned as well. So we don’t only crawl our initial seed list, but also their outbound links as well.

To control which links get crawled, you can do that in the $NUTCH_HOME/conf/regex-urlfilter.txt. In this file, you give a regular expression of the kind of URLs that should be crawled and all others ignored.
So replace:


# accept anything else
+.

with this:

+^https?://([a-z0–9-]+\.)*nutch\.apache\.org/

We are saying here is that any URL that does not end with nutch.apache.org should be totally ignored. That way, https://someurl.com will not be entertained even if it appears in the nutch website.

Before we go on to crawl, let’s understand how the Nutch crawling process works. This way, you get to make sense of every command you type.

The first step is to inject your URLs into the crawldb. The crawldb is the database that holds all known links. It is the storage for all our links crawled or not.

You might ask, don’t we know all our links beforehand? Like, didn’t we specify them in seed.txt?
Well, not quite. The list of URLs in seed.txt is our initial list, to begin with. When we crawl these seed.txt URLs, if we encounter other outbound links, we’d still add them to our list of known links [read that again]. And if these new links have other outbound links, we still add them to our list, and the cycle continues. So crawldb stores all our known links that keep adding up at every cycle. Crawldb doesn’t only contain the links but also the metadata of these links, like the fetch status of each link.
With that, a fetch list will be a list (of course) of links that are to be crawled next. This list will depend on the content of our crawldb. So to get an accurate fetch list, anytime we perform a fetch job, we must update the crawldb.
But if the crawldb contains only links and their metadata, where are all the other contents like the HTML and images stored? Nutch stores these as a segment in the segments directory using the current time as the name.

  • So crawldb contains known links -crawled or not.
  • A segment contains all returned content including their links
  • A fetch list contains links to be crawled next.

Now let us see that in action.
I’ll assume you are in your $NUTCH_HOME directory to run these commands.

Run the following command to inject the URL(in our case, HTTP://nutch.apache.org) into the crawldb:

bin/nutch inject crawl/crawldb urls

Next, we need to generate a fetch list from the crawldb and save it in the crawl/segment directory:

bin/nutch generate crawl/crawldb crawl/segments

The generate command uses the current timestamp as the name of the segment.

Now if we want to interact with our segment, which we will, we have to go to the segment folder and guess which of the timestamp folder is our segment, and then copy it. To make life easy for us, let us save this currently created segment’s path in a shell variable called s1:

s1=`ls -d crawl/segments/2* | tail -1`

The command list all directories in crawl/segments and we select the last one and assign it to s1 because that will be our current segment.

You can print s1 to see what it contains: echo $s1

Now is time to do the real crawling and it might take a while to finish:

bin/nutch fetch $s1

Then parse the returned content to remove unnecessary stuff out of it:

bin/nutch parse $s1

Remember, every fetch job returns a set of new links. So we need to update our crawldb with the new links:

bin/nutch updatedb crawl/crawldb $s1

By now, our crawldb has a new set of links and also updated old links. It will be wise to do some crawling again. But this time, since we have a bunch of links to handle, we can specify the number of important links we want to crawl, and not all the links. Somehow, Nutch can tell which links are more important, because it has a scoring system for our links.

So the following commands repeat the crawling process:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

The commands above are almost the same as our first crawl, except for the -topN 1000 part. This ensures that only the top 1000 scoring links are crawled. The s2 variable holds our new segment.

An important aspect that also affect how many links we crawl is the depth of our crawl job. The depth ensures that the crawl job is done a number of times rather than the default single crawl. So -depth 3 will ensure we run the crawl three times in a single run.

Now our crawldb has more new links, right? We can perform yet another crawl job to get more data:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

This time around with a new segment whose directory information is stored in s3 variable.
And as you would guess, this can go on and on and on, like a recursion. You get to decide when you have enough data. If we are to continue, we will create a new segment and save the directory information in, say, s4. But we will stop here.

Inverting links

Now we have the crawl/segments directory filled with the data we crawled earlier, we need to invert the links in these segments. The reason is since these segments will contain so many links, we should find a way to specify which links are important. These should be our inbound links and not the outbound ones, right?
The bellow command does the inverting and pushes these links to the linkdb. So the linkdb holds our inverted links:

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Now everything is said and done. The last step is to index our data into the Solr server in order to perform our complex queries:

bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize

This passes the crawldb directory, the linkdb directory, the segment directory, and then with an option to filter links that were rejected by the URL filters and also normalize URLs before indexing. There is also another option to delete URLs that are not found and duplicates. To include that option, the command will look like this:

bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize -deleteGone 

Once we run this command, our running Solr server will get our data ready for querying.
Now go to your browser and visit the URL: localhost:8983/solr to access the Solr Webapp and select the nutch core in core selector:

Solr web app select nutch core
  • After selecting the nutch core, click on query.
  • Then enter your search term in the q field and click on Execute Query.

Opps!
You will be greeted with an error. The thing is we are trying to make a query but Solr doesn’t know where to search from. We have to specify which field we want Solr to look at. This is done by setting the `fl` field.

Setting Solr fl field

We can also set the default df field, so in the case that no field is specified, we revert to the default field.
But what if we want to make a really general search (i.e we want Solr to look at all fields). General search is supposed to work out of the box. However, something preventing this behavior.

There is a solrConfig.xml file that we can use to define defaults for our queries, like the default field or what format we want our response in. This file also contains these lines:

<initParams path="/update/**,/query,/select,/spell">
<lst name="defaults">
<str name="df">_text_</str>
</lst>
</initParams>

Replace _text- with text and that should fix the general search issue:

<initParams path="/update/**,/query,/select,/spell">
<lst name="defaults">
<str name="df">text</str>
</lst>
</initParams>

--

--

Geeks Prep
Geeks Prep

Published in Geeks Prep

This publication contains a few of my thought, and also tutorials.

Stephen Kastona
Stephen Kastona

Written by Stephen Kastona

I learn, unlearn and relearn. I write code. I’m a christian. I love family. I love to grow.

Responses (5)