Creating custom Snowball filter for Lucene (Solr): Step-by-step Guide

Published in

Code Rocket

6 min readNov 28, 2017

I’m working on a project where we have decided to use Solr as our high-speed searching storage.

Apache Solr is a fast open-source Java search server. Solr enables you to easily create search engines which searches websites, databases and files.

As our language is Czech with 7 cases and countless number of case endings which makes Czech language far more difficult than English (in meaning of finding the base form of the word) we have decided to cut all case endings and store our keywords as pure stems.

Solr (actually Lucene itself does) provides a default Czech language stemmer based on Snowball.

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.

Writing a good stemmer is probalby a neverending story (especially for the language as Czech with bunch of exceptions), but the default Snowball implementation included in Solr is dumb returning a lot of nonsense which can be easily fixed.

So I decided to write implement own stemmer for Solr. Uff.. I didn’t know..

Long story short

There is not much documentation about it and Lucene is using an outdated version of Snowball (which is not stated anywhere). Yet I had to deal with Java which I’m not a big fan of (but it’s a my problem, I know).

Ok, let’s get hands dirty

So let start with some work. First we need to download Snowball source code:

$ git clone git@github.com:snowballstem/snowball.git

Then go to the downloaded folder and build it:

$ make

Of course you need a Snowball script itself. I started with a default implementation for Czech language but use the language you need. I created a file cz.sbl in the root of the Snowball folder and saved the content from the link above.

Thereafter you need to save driver.c file from this url to the root Snowball folder as well.

Let’s create a new folder called q and copy files runtime/api.c , runtime/api.h , runtime/header.h and runtime/utilities.c into that folder.

Finally build Snowball java file:

$ .snowball cz.sbl -j -n czech-stemmer -o temp/czech-stemmer -utf8

Where cz.sbl is the file created above, -j parameter tells Snowball to create a Java file, -n czech-stemmer tells Snowball to name the Java classCzechStemmer , -o temp/czech-stemmer tells Snowball where to save it and -utf8 is pretty clear.

It should create a czech-stemmer.java file in the temp subfolder of Snowball root folder. We will need this file later when compiling a jar file for Lucene (I will refer to this file as Snowball java file).

Lucene

To be able to use this Snowball java file we need to create a jar file for which wee need two Lucene jar files to be referenced. So it’s time to download Lucene and build it. Download it, unpack, go to the folder and run:

$ ant dist

After couple of minutes (depends on how fast your lathe is) you should be able to get two files build/core/lucene-core-7.1.0-SNAPSHOT.jar and build/analysis/common/lucene-analyzers-common-7.1.0-SNAPSHOT.jar .

Preparing Java files for build

Now it comes to the point we could easily build jar file for Solr and we are done. But unfortunately not. As I have written before, Lucene is using some older outdated version of Snowball, so we have to patch our Snowball java file (generated above).

There are three steps needed:

Add methodObject passed to each Among class constructor:

/* find this line: */
private static final long serialVersionUID = 1L;/* and add following under it: */
private static final java.lang.invoke.MethodHandles.Lookup methodObject = java.lang.invoke.MethodHandles.lookup() ;

2. Use methodObject in each Among constructor. We will add two more parameters — an empty string and methodObject lookup created on the step above:

/* find each new Among(.. */
new Among("ce", -1, 1),
new Among("ze", -1, 2),
new Among("\u00BEe", -1, 2),
.../* and change it to: */
new Among("ce", -1, 1, "", methodObject),
new Among("ze", -1, 2, "", methodObject),
new Among("\u00BEe", -1, 2, "", methodObject),
...

3. Find each calling of method find_among_b accepting an array and add one more parameter with the length of the array passed as the first parameter:

/* find method call like this: */
among_var = find_among_b(a_0);/* and change it to this: */
among_var = find_among_b(a_0, a_0.length);/* do it for all a_* you have: a_1 => a_1, a_1.length etc. */

4. For building jar file you need few more things to do. Create a new folder and create a build.xml file for Ant with following content in it:

<project>
  <target name="clean">
    <delete dir="build"/>
  </target>
  <target name="compile">
    <mkdir dir="build/classes"/>
    <javac srcdir="src" destdir="build/classes">
      <classpath>
        <fileset dir="libs">
          <include name="**/*.jar" />
        </fileset>
      </classpath>
    </javac>
  </target>
  <target name="jar">
    <mkdir dir="build/jar"/>
      <jar destfile="build/jar/czech-stemmer.jar" basedir="build/classes"/>
  </target>
</project>

Create subfolder src and copy there that Snowball java file that we have patched a few moments ago.

Go to the folder with Lucene source code and then copy filesanalysis/common/src/java/org/tartarus/snowball/Among.java and analysis/common/src/java/org/tartarus/snowball/SnowballProgram.java to the src folder where your patched Snowball java czech-stemmer.java file is.

5. Now is the time to write some unique Java code. You need two classes — one extending TokenFilter class and another extending TokenFilterFactory class. These two classes will Lucence call to use your Snowball code. I have named them CzechSnowballFilter and CzechSnowballFilterFactory :

package com.solr.custom.filter.test;
import java.io.IOException;import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.tartarus.snowball.ext.CzechStemmer;/**
 * @author SumeetS
 *
 */
public class CzechSnowballFilter extends TokenFilter {private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);/**
  * @param input
  */
 public CzechSnowballFilter(TokenStream input) {
   super(input);
 }/* (non-Javadoc)
  * @see org.apache.lucene.analysis.TokenStream#incrementToken()
  */
 @Override
 public boolean incrementToken() throws IOException {
   if (input.incrementToken()) {
   String parsed = null;
        try {
     CzechStemmer stemmer = new CzechStemmer();
     parsed = termAttr.toString();
     stemmer.setCurrent(parsed);
     if (stemmer.stem()){
      parsed = stemmer.getCurrent();
     }
        } catch (Exception e) {
          e.printStackTrace();
        }char[] parsedArray = parsed.toCharArray();termAttr.setEmpty();
        termAttr.copyBuffer(parsedArray, 0, parsedArray.length);return true;
   } else
   return false;
 }
}

And the factory:

package com.solr.custom.filter.test;
import java.util.Map;import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;/**
 * @author SumeetS
 *
 */
public class CzechSnowballFilterFactory extends TokenFilterFactory {/**
  * @param args
  */
  public CzechSnowballFilterFactory(Map<String, String> args) {
   super(args);
   if (!args.isEmpty()) {
    throw new IllegalArgumentException("Unknown parameters: " + args);
   }
  }/* (non-Javadoc)
  * @see org.apache.lucene.analysis.util.TokenFilterFactory#create(org.apache.lucene.analysis.TokenStream)
  */
  @Override
  public TokenStream create(TokenStream input) {
   return new CzechSnowballFilter(input);
  }}

In short — CzechFilterFactory class creates an instance of a CzechFilter class which creates an instance of CzechStemmer class which parses the string and assign it to the variable parse is the parsing was ok.

6. Copy those lucene-core-7.1.0-SNAPSHOT.jar and lucene-analyzers-common-7.1.0-SNAPSHOT.jar files to the new libs folder you create on the same level as src and build.xml .

It should look like this now:

-src
 |_CzechStemmer.java
 |_Among.java
 |_SnowballProgram.java
-libs
 |_lucene-core-7.1.0-SNAPSHOT.jar
 |_lucene-analyzers-common-7.1.0-SNAPSHOT.jar
-build.xml

Finally go to this folde in command line and build a jar file:

$ ant compile
$ ant jar

Voila! You should have a subfolder build/jar with file czech-stemmer.jar .

Deploying to Solr

Stop Solr:

$ sudo service solr stop

It depends how is your Solr configured, but my core data are in folder /var/solr/data . Under this folder you should have a folder for each core you are running in Solr and each core has its own configuration in files /var/solr/data/[NAME-OF-YOUR-CORE]/conf/solrconfig.xml where is your core configuration and /var/solr/data/[NAME-OF-YOUR-CORE]/conf/schema.xml where is the structure of stored data defined.

Go to the first one and add the path to the directory where you want to store the czech-stemmer.jar (or you can use some already defined folder). I have added this:

<lib dir="${solr.install.dir:../../../..}/contrib/lib/" regex=".*\.jar" />

So I upload the czech-stemmer.jar file to the /opt/solr/contrib/lib (again it depends on how is your Solr installed, but basically you should specify some subfolder of the Solr installation folder) . Upload the file to that folder.

We have done it all, now we must put it in use in schema.xml file:

<?xml version="1.0" encoding="UTF-8" ?>
  <schema name="[NAME-OF-YOUR-CORE]" version="0.5">
    <fields>
      <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
      <field name="title" type="czech_word" indexed="true" stored="true" multiValued="false"/>
      <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <types>
      <fieldType name="czech_word"
       class="solr.TextField"
       sortMissingLast="true"
       omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
          synonyms="lang/synonyms_cz.txt"
          ignoreCase="true"
          expand="false"/>
        <filter class="com.solr.custom.filter.test.CzechSnowballFilterFactory"/
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="lang/stopwords_cz.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/
      </analyzer>
    </fieldType>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
  </types>
</schema>

Finally you can start Solr and reload the core:

$ sudo service solr start
$ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=[NAME-OF-YOUR-CORE]'

Trying it

Now you can go to the http://localhost:8983/solr/#/[NAME-OF-YOUR-CORE]/analysis , fill in some word, select the czech_word field type and test it.

Conclusion

Well, that’s it! If you have some questions, write to comments and you can make me happy by Clappin’!