Creating custom Snowball filter for Lucene (Solr): Step-by-step Guide
I’m working on a project where we have decided to use Solr as our high-speed searching storage.
Apache Solr is a fast open-source Java search server. Solr enables you to easily create search engines which searches websites, databases and files.
As our language is Czech with 7 cases and countless number of case endings which makes Czech language far more difficult than English (in meaning of finding the base form of the word) we have decided to cut all case endings and store our keywords as pure stems.
Solr (actually Lucene itself does) provides a default Czech language stemmer based on Snowball.
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.
Writing a good stemmer is probalby a neverending story (especially for the language as Czech with bunch of exceptions), but the default Snowball implementation included in Solr is dumb returning a lot of nonsense which can be easily fixed.
So I decided to write implement own stemmer for Solr. Uff.. I didn’t know..
Long story short
There is not much documentation about it and Lucene is using an outdated version of Snowball (which is not stated anywhere). Yet I had to deal with Java which I’m not a big fan of (but it’s a my problem, I know).
Ok, let’s get hands dirty
So let start with some work. First we need to download Snowball source code:
$ git clone git@github.com:snowballstem/snowball.git
Then go to the downloaded folder and build it:
$ make
Of course you need a Snowball script itself. I started with a default implementation for Czech language but use the language you need. I created a file cz.sbl
in the root of the Snowball folder and saved the content from the link above.
Thereafter you need to save driver.c
file from this url to the root Snowball folder as well.
Let’s create a new folder called q
and copy files runtime/api.c
, runtime/api.h
, runtime/header.h
and runtime/utilities.c
into that folder.
Finally build Snowball java file:
$ .snowball cz.sbl -j -n czech-stemmer -o temp/czech-stemmer -utf8
Where cz.sbl
is the file created above, -j
parameter tells Snowball to create a Java file, -n czech-stemmer
tells Snowball to name the Java classCzechStemmer
, -o temp/czech-stemmer
tells Snowball where to save it and -utf8
is pretty clear.
It should create a czech-stemmer.java
file in the temp
subfolder of Snowball root folder. We will need this file later when compiling a jar
file for Lucene (I will refer to this file as Snowball java file).
Lucene
To be able to use this Snowball java file we need to create a jar
file for which wee need two Lucene jar files to be referenced. So it’s time to download Lucene and build it. Download it, unpack, go to the folder and run:
$ ant dist
After couple of minutes (depends on how fast your lathe is) you should be able to get two files build/core/lucene-core-7.1.0-SNAPSHOT.jar
and build/analysis/common/lucene-analyzers-common-7.1.0-SNAPSHOT.jar
.
Preparing Java files for build
Now it comes to the point we could easily build jar
file for Solr and we are done. But unfortunately not. As I have written before, Lucene is using some older outdated version of Snowball, so we have to patch our Snowball java file (generated above).
There are three steps needed:
- Add
methodObject
passed to eachAmong
class constructor:
/* find this line: */
private static final long serialVersionUID = 1L;/* and add following under it: */
private static final java.lang.invoke.MethodHandles.Lookup methodObject = java.lang.invoke.MethodHandles.lookup() ;
2. Use methodObject
in each Among
constructor. We will add two more parameters — an empty string and methodObject
lookup created on the step above:
/* find each new Among(.. */
new Among("ce", -1, 1),
new Among("ze", -1, 2),
new Among("\u00BEe", -1, 2),
.../* and change it to: */
new Among("ce", -1, 1, "", methodObject),
new Among("ze", -1, 2, "", methodObject),
new Among("\u00BEe", -1, 2, "", methodObject),
...
3. Find each calling of method find_among_b
accepting an array and add one more parameter with the length of the array passed as the first parameter:
/* find method call like this: */
among_var = find_among_b(a_0);/* and change it to this: */
among_var = find_among_b(a_0, a_0.length);/* do it for all a_* you have: a_1 => a_1, a_1.length etc. */
4. For building jar
file you need few more things to do. Create a new folder and create a build.xml
file for Ant with following content in it:
<project>
<target name="clean">
<delete dir="build"/>
</target>
<target name="compile">
<mkdir dir="build/classes"/>
<javac srcdir="src" destdir="build/classes">
<classpath>
<fileset dir="libs">
<include name="**/*.jar" />
</fileset>
</classpath>
</javac>
</target>
<target name="jar">
<mkdir dir="build/jar"/>
<jar destfile="build/jar/czech-stemmer.jar" basedir="build/classes"/>
</target>
</project>
Create subfolder src
and copy there that Snowball java file that we have patched a few moments ago.
Go to the folder with Lucene source code and then copy filesanalysis/common/src/java/org/tartarus/snowball/Among.java
and analysis/common/src/java/org/tartarus/snowball/SnowballProgram.java
to the src
folder where your patched Snowball java czech-stemmer.java
file is.
5. Now is the time to write some unique Java code. You need two classes — one extending TokenFilter
class and another extending TokenFilterFactory
class. These two classes will Lucence call to use your Snowball code. I have named them CzechSnowballFilter
and CzechSnowballFilterFactory
:
package com.solr.custom.filter.test;
import java.io.IOException;import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.tartarus.snowball.ext.CzechStemmer;/**
* @author SumeetS
*
*/
public class CzechSnowballFilter extends TokenFilter {private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);/**
* @param input
*/
public CzechSnowballFilter(TokenStream input) {
super(input);
}/* (non-Javadoc)
* @see org.apache.lucene.analysis.TokenStream#incrementToken()
*/
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String parsed = null;
try {
CzechStemmer stemmer = new CzechStemmer();
parsed = termAttr.toString();
stemmer.setCurrent(parsed);
if (stemmer.stem()){
parsed = stemmer.getCurrent();
}
} catch (Exception e) {
e.printStackTrace();
}char[] parsedArray = parsed.toCharArray();termAttr.setEmpty();
termAttr.copyBuffer(parsedArray, 0, parsedArray.length);return true;
} else
return false;
}
}
And the factory:
package com.solr.custom.filter.test;
import java.util.Map;import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;/**
* @author SumeetS
*
*/
public class CzechSnowballFilterFactory extends TokenFilterFactory {/**
* @param args
*/
public CzechSnowballFilterFactory(Map<String, String> args) {
super(args);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}/* (non-Javadoc)
* @see org.apache.lucene.analysis.util.TokenFilterFactory#create(org.apache.lucene.analysis.TokenStream)
*/
@Override
public TokenStream create(TokenStream input) {
return new CzechSnowballFilter(input);
}}
In short — CzechFilterFactory
class creates an instance of a CzechFilter
class which creates an instance of CzechStemmer
class which parses the string and assign it to the variable parse
is the parsing was ok.
6. Copy those lucene-core-7.1.0-SNAPSHOT.jar
and lucene-analyzers-common-7.1.0-SNAPSHOT.jar
files to the new libs
folder you create on the same level as src
and build.xml
.
It should look like this now:
-src
|_CzechStemmer.java
|_Among.java
|_SnowballProgram.java
-libs
|_lucene-core-7.1.0-SNAPSHOT.jar
|_lucene-analyzers-common-7.1.0-SNAPSHOT.jar
-build.xml
Finally go to this folde in command line and build a jar file:
$ ant compile
$ ant jar
Voila! You should have a subfolder build/jar
with file czech-stemmer.jar
.
Deploying to Solr
Stop Solr:
$ sudo service solr stop
It depends how is your Solr configured, but my core data are in folder /var/solr/data
. Under this folder you should have a folder for each core you are running in Solr and each core has its own configuration in files /var/solr/data/[NAME-OF-YOUR-CORE]/conf/solrconfig.xml
where is your core configuration and /var/solr/data/[NAME-OF-YOUR-CORE]/conf/schema.xml
where is the structure of stored data defined.
Go to the first one and add the path to the directory where you want to store the czech-stemmer.jar
(or you can use some already defined folder). I have added this:
<lib dir="${solr.install.dir:../../../..}/contrib/lib/" regex=".*\.jar" />
So I upload the czech-stemmer.jar
file to the /opt/solr/contrib/lib
(again it depends on how is your Solr installed, but basically you should specify some subfolder of the Solr installation folder) . Upload the file to that folder.
We have done it all, now we must put it in use in schema.xml
file:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="[NAME-OF-YOUR-CORE]" version="0.5">
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="title" type="czech_word" indexed="true" stored="true" multiValued="false"/>
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<fieldType name="czech_word"
class="solr.TextField"
sortMissingLast="true"
omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="lang/synonyms_cz.txt"
ignoreCase="true"
expand="false"/>
<filter class="com.solr.custom.filter.test.CzechSnowballFilterFactory"/
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_cz.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/
</analyzer>
</fieldType>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
</types>
</schema>
Finally you can start Solr and reload the core:
$ sudo service solr start
$ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=[NAME-OF-YOUR-CORE]'
Trying it
Now you can go to the http://localhost:8983/solr/#/[NAME-OF-YOUR-CORE]/analysis
, fill in some word, select the czech_word
field type and test it.
Conclusion
Well, that’s it! If you have some questions, write to comments and you can make me happy by Clappin’!