Galago — the secret documentation

Galago is an open source text retrieval engine (aka search engine). It is designed for you to experiment with different components and for you to design your own retrieval model out of model-components.

You can download it from here: lemurproject.org/galago.php

If you find bugs, please report them here: http://sourceforge.net/p/lemur/bugs/

In order to use galago as a dependency from your code, you can use the following maven bindings in your pom.xml file

<dependency>
<groupId>org.lemurproject.galago</groupId>
<artifactId>core</artifactId>
<version>3.7</version>
</dependency>
<repositories>
<repository>
<id>edu.umass.ciir.releases</id>
<name>CIIR Nexus Releases</name>
<url>http://scm-ciir.cs.umass.edu:8080/nexus/content/repositories/releases/</url>
</repository>
</repositories>

You may want to check for new versions.

Software installation

You cannot use a Windows Operating System. Please make sure you have an Ubuntu or Mac system. Even if you run Ubuntu from a Virtual Machine on Windows.

Oracle Java 7 JDK

http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

Build Tool and Dependency Management: Maven

You can download it from here: http://maven.apache.org/download.cgi

Some OS X versions come with maven pre-installed in /usr/share/maven

Or on ubuntu/debian with

sudo apt-get install maven

On mac you can optionally upgrade maven with

sudo port install maven3

Version Control System: Git and Mercurial (optional)

Download Git here: http://git-scm.com/downloads

for mac download http://git-scm.com/download/mac

or on ubutu/debian with

sudo apt-get install git

on mac with

sudo port install git-core +svn +doc +bash_completion +gitweb

Download Mercurial here: http://mercurial.selenic.com/wiki/Download

for mac download http://mercurial.berkwood.com/

or on ubuntu/debian with

sudo apt-get install mercurial

on mac with

sudo port install mercurial

Programming Environment: IntelliJ Idea

IntelliJ Idea Community Edition (recommended) http://www.jetbrains.com/idea/download/

*needs Oracle JDK; fiddle with PATH and JAVA_HOME as necessary

Task 1: Indexing a Document Collection

Getting a Collection

Collection note: if you have some data you’re using for your research, go ahead and create an index of that.

You find some small sample collections here: http://www.search-engines-book.com/collections/ (do not use the corpus files, the format has changed!)

Indexing your Collection

The magic incantation to build an index from the contents of a directory look like this, in general.

galago build —inputPath=${INPUT_DIR} —indexPath=${OUTPUT_DIR}

Format Gotcha

However, since the data file we copied has the unhelpful “.dat” extension, when it’s really trectext, we need to give galago a hint as to the filetype of our data, or else we’ll end up indexing zero documents.

galago build —inputPath=#{INPUT_DIR}/ —indexPath=hackathon-index —filetype=trectext

Just waiting for 64,139 documents to be indexed…

As Galago churns over those documents (there are a lot of short documents!) it will print updates occasionally to the console:

INFO: Processing split: #{file}

Mar 17, 2014 9:22:54 AM org.lemurproject.galago.core.parse.UniversalParser process

WARNING: Read 10000 from split: #{file}

Did it work?

The last two lines of output should be these:

Done Indexing.

Documents Indexed: #{some number}.

When something goes wrong, you tend to get a Java exception stack trace instead.

Using the search frontend on your new index

Prepare to be amazed. Galago contains a pretty useful sanity-check tool, in the form of a web search frontend. Kick it off like this:

galago search -port=8080 — index=hackathon-index

On success, Galago will print out a link to your server; you can jump there using this link instead:

http://localhost:8080/

I found that there were no results for “hackathon”, but there were definitely results for “news”.

Try out various operators, like “#sdm(news stories)”. Click on “[debug]” in the web interface to show you your query in various expanded forms. The three separate queries here will reappear in Task #1, as you write code to execute queries through Java.

Congratulations, you’ve completed Task 1!

Task 2: Retrieve from your Index

Compare the retrieval effectiveness of query likelihood to another
retrieval models of your choice.
a) Display top 10 document identifiers for the two retrieval models (for
one query)
b) Compare the two retrieval models using galago-eval (Tip: on the command line call “galago eval” for usage information).

c) Perform a paired-t-test with 5% confidence level

Possible choices / Implement as many as possible:
- sequential dependence model and Dirichlet smoothing
- RM3 (using galago operator)
- RM3 (using your own relevance model implementation)
- using max passage retrieval
- using field retrieval

What operators are available?

Check out org.lemurproject.galago.core.retrieval.FeatureFactory

in particular, it has primitive operations and macro (traversal) operations later.

How to use a different Smoothing than Dirichlet Smoothing?

If you want to use JM smoothing instead of Dir smoothing, define following parameters in the parameter file or set them at runtime :

“scorer” : “jm”

“lambda” : 0.9

What’s the default? Find the code: JelinekMercerScoringIterator

How to use passage retrieval:

Where you pass in the query string, also add a Parameters() instance as a second argument.

set the following entries in the parameters.

Parameters p = new Parameters();
int defaultPassageSize = 50;
int defaultPassageShift = 25;
List<String> = …. // from a first pass document retrieval
p.set("passageQuery", true);
p.set("passageSize", defaultPassageSize);
p.set("passageShift", defaultPassageShift);
p.set("working", workingSetDocuments); // !! from a first pass!

Before you issue the passage query it is highly recommended that you perform a document retrieval to build up a working set. The working set is supposed to contain the documentnames from the ScoredDocuments.

Or, use TwoPassDocumentPassageModel instead of RankedPassageModel.

p.set("processingModel", TwoPassDocumentPassageModel.class.getName());

See ProcessingModel.instance() static factory method for more details.

—-> I’ve actually heard rumors that TwoPassDocumentPassageModel has a few bugs, from Mostafa. If you see anything weird, he’s the expert.

When you issue a passage query like above, how do you actually get the passages?

you issue the query and you will get a list of ScoredDocument.

You need to cast each ScoredDocument into a ScoredPassage.

Then you can access begin and end, which refer to indices into the term vector, e.g.

List<String> passageTerms = document.terms.subList(scoredPsg.begin, scoredPsg.end);

Passage-Extent Retrieval

Instead of retrieving a passage of N terms, you can also retrieve N extents (or tags) from the document, e.g. 3 sentences. This requires that you indexed your documents with tags (e.g. every sentence being wrapped by a <sentence> .. </sentence> tag and that tag being configured as a field during indexing. Then you can use extentName = “sentence”, numberOfExtentsPerPsg=3, shiftPsgByNumberOfExtents=1.

p.set("extentQuery", true);
p.set("extent", extentName);
p.set("extentCount", numberOfExtentsPerPsg);
p.set("extentShift", shiftPsgByNumberOfExtents);
p.set("working", workingSetDocuments); // !! from a first pass!

Notice, you definitely want to use a workingset or things will get very slow.

How to use Galago’s RM

You use #rm(… ) as part of the query string. You can control how many documents and terms are used for expansion, and the weight on the original query versus the expansion terms with a parameter

Parameters p = new Parameters();
p.set("fbOrigWt", 0.8);
p.set("fbDocs", 20);
p.set("fbTerm",10);

How to set SDM parameters

SDM has a unigram, a bigram, and a windowed skip-bigram part. Default weights are 0.8, 0.15, 0.05, but you can change them via the parameter settings

Parameters p = new Parameters();
p.set("uniw", 0.8);
p.set("odw", 0.15);
p.set("uww", 0.05);

How to build a weighted combine model.

.. or any other nested combination of different retrieval models

This is the pseudo-code two-liner.

val weightsStr = for ((weight, idx) <- weights.zipWithIndex) yield {
idx + "=" + weight
}
"#combine" + weightsStr.mkString(":", ":", "") + "(" + subqueries.mkString(" ") + ")"

Example “#combine0=0.1:1=0.2:2=0.3( first second third )”

will assign 0.1 weight to query term “first”, 0.2 to query term “second” and 0.3 to query term “third”

How to build a weighted combine model with nodes instead of strings:

public static Node toWeightedCombine(List<WeightedTerm> terms) {

Node combine = new Node("combine");
for (int i = 0; i < terms.size(); i++) {
WeightedTerm wt = terms.get(i);
combine.getNodeParameters().set(Integer.toString(i), wt.score);
combine.addChild(Node.Text(wt.term));
}
return combine;

You can also “hand-roll” your sdm model like this

“#combine0=0.1:1=0.2:2=0.3( #combine( first second third) #ordered:2( first second third) #combine( #unordered:8(first second) #unordered:8(second third) ) )

How to do field-retrieval?

Your document may have different fields, e.g. “title” versus “body”. You can restrict the retrieval model to only one of them. For instance if you want to run an SDM query “first second third” on the title field only you use

#sdm( first.title second.title third.title)

my pseudocode oneliner:

"#sdm(" + terms.map(_ + "."+singleField).mkString(" ") + ")"

You can use the weighted combine method to have give different weights to different fields, e.g. 0.8 to the title, and 0.2 to the body.

#combine0=0.8:1=0.2( #sdm( first.title second.title third.title) #sdm( first.body second.body third.body) )

Prevent tokenization

How do I keep Galago from tokenizing a field that you want to query as a whole?

Put the attribute tokenizeTagContent=”false” this in your trec text field definition before indexing

<fieldname tokenizeTagContent=”false”>Water resources management in modern Egypt</fieldname>

(P.S. I am not sure whether this is only a feature of one of Jeff’s customized versions of Galago, or a standard feature in Galago)

Proper field length normalization in field retrieval

example index with several fields “entrez_gene_id”, “go”, “desc”. using TagTokenizer no stemming.

to perform field retrieval use

#combine(term1.field term2.field term3.field)

you can also do SDM as

#sdm(solute.desc binding.desc protein.desc)

Be aware that all field names are lower-cased (no matter how your input looked like)

Make sure all terms are lower cased and the same tokenization is performed as tag tokenizer would do it (e.g. symbols like “_” split into two tokens.)

The equation for Query likelihood and SDM makes use of some “document length” variable. By default, this is the total length of the document! (Not the field length!)

If you want the field length (say, you have translation into different languages in separate fields, or you abuse galago as a data base), then this is not what you want.

You can tell galago with length to use with the “lengths=go” parameter of the #dirichlet operator and as a first argument “ #lengths:go:part=lengths()”

Example:

Proper field length normalization, here for field “go” and query “0051471"

#combine:w=1.0( #dirichlet:lengths=go( #lengths:go:part=lengths() #counts:@/0051471/:part=field.go() ) )

If you want SDM, You cannot use the pre-baked #sdm operator. But you can “hand roll” your own through a weighted combine of unigrams and bigrams, where bigrams occur once as #ordered:1 and then as #unordered:8 with appropriate weights spread out.

Here an example on what goes inside these terms:

Proper field lengths normalization in bigrams (unordered:8)

#combine:w=1.0( #dirichlet:lengths=desc( #lengths:desc:part=lengths() #unordered:8( #extents:solute:part=field.desc() #extents:binding:part=field.desc() )) )

If your query terms contain reserved words in the galago language, you can escape them with “@/” and “/”, e.g. “@/queryterm/”. This also works in the above style:

… same with escaping…

#combine:w=1.0( #dirichlet:lengths=desc( #lengths:desc:part=lengths() #unordered:8( #extents:@/solute/:part=field.desc() #extents:@/binding/:part=field.desc() )) )

How do to boolean set retrieval in galago

#combine( #counts:@/0051471/:part=field.go() )

#combine ( #unordered:8( #extents:solute:part=field.desc() #extents:binding:part=field.desc() )

#combine ( #band ( op1 op2 ) )

Nested boolean queries, such as AND ( AND() OR())

In order to implement boolean queries, you need to know about a set of operators that work together

#bool (boolean-expression): produces a 0/1 score. This will start a boolean query when not feeting it to #reject or #require.

#band: “boolean and” of expressions

#inside: checks for matches inside fields, e.g. #inside(#od:1 (term1 term2) #field:name())

Use #bool only once to wrap the expression

Example:

#bool(#band(#inside(#od:1(term1 term2) #field:name()) ...)))
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.