Lucene + Jython = profit

Jorge Herrera
5 min readDec 1, 2017

--

The power of Lucene … without your eyes bleeding

For an internal research project here at Shazam we needed to search a big text corpus. I’m not an NLP person, but we have some experts in the area at Shazam’s RAD team (companies usually call it R&D, but we aren’t just any company). After posing the problem to them, they immediately suggested to use an inverted index. They even recommended Apache Lucene for the task.

While I’ve done java programming in the past, I really don’t enjoy it (it’s been said that reading java code can make your eyes bleed!). “Maybe there’s a Python version” I thought. In fact, there is an official one, but the installation instructions discouraged me (not too complicated, really, but I prefer the simple pip install <package>). After pondering other alternatives, I remembered Jython.

There’s a good chance that you haven’t heard about Jython. It can be really handy in situations when you want/need to write Python code, but there’s a library/package written in Java that you could use. It has been about 10 years since a good friend of mine mentioned it (he was working on adding django support to Jython for a Goolge SoC). Since then, I never really had a need for Jython, but I was well aware of it. Finally, the time came.

What the hell is Jython, you ask? In short, Jython is a Python interpreter that allows you to import and use straight java code. Why would that be useful, you inquire? Well, the Jython wiki has a few reasons, but the one that jumped to me was:

Rapid application development — Python programs are typically 2–10X shorter than the equivalent Java program. This translates directly to increased programmer productivity. The seamless interaction between Python and Java allows developers to freely mix the two languages both during development and in shipping products.

So I decided to give it a shot.

A really simple example

Since I was working in a domain out of my knowledge zone-of-comfort, with tools that I haven’t used before, I decided to implement a simple demo as a proof of concept. It is based on this Lucene in 5 minutes tutorial, reimplemented in a pythonic way.

The Jython side of things

Before start, I had to install Jython. In OS X this is as simple as brew install jython (I’m sure it is just as easy on other platforms). With that in place you can run a Python script the same way you’d do with standard Python, e.g. jython demo.py. This command will execute any standard Python script. Of course, the real power comes from the ability to use java packages as if they were Python ones. The process is fairly simple:

  1. Add the .jar files to Python’s search path (I should say Jython’s search path);
  2. Import what you need as if it was a standard Python module;
  3. Profit!

For example:

import sys
sys.path.append("lucene-7.1.0/core/lucene-core-7.1.0.jar")
# Now that jars are in the path, we can import java code as if it
# was regular Python!
from org.apache.lucene.analysis.standard import StandardAnalyzer
analyzer = StandardAnalyzer()

This simple script won’t do anything useful, but if you can run it without getting exceptions (run by typing jython dumb_script.py in a terminal), you are good to go! That’s basically all you need for a Jython script to use java code. The rest of this post will briefly explain how to use Lucene for an inverted index search.

The Lucene side of things

The functionality required for this demo lies in 2 jar files from Lucene, so you will need to add those two to the Jython’s path.

import sys
jars = [
"lucene-7.1.0/core/lucene-core-7.1.0.jar",
"lucene-7.1.0/queryparser/lucene-queryparser-7.1.0.jar",
]
for jar in jars:
sys.path.append(jar)

The demo creates a basic search index and then shows how this index can be used to search given a query string.

Index creation

Let’s look at the index creation first (I’ll skip the imports for now, but the full code can be found at the end):

def make_index(analyzer):
""" Create an inverted index to power the search. """
def add_doc(w, title, isbn):
""" Utility to add "documents" to the index. """
doc = Document()
doc.add(TextField("title", title, Field.Store.YES))
# use a string field for isbn because we don't
# want it tokenized
doc.add(StringField("isbn", isbn, Field.Store.YES))
w.addDocument(doc)
# create the index
index = RAMDirectory()
config = IndexWriterConfig(analyzer) with closing(IndexWriter(index, config)) as w:
add_doc(w, "Lucene in Action", "193398817")
add_doc(w, "Lucene for Dummies", "55320055Z")
add_doc(w, "Managing Gigabytes", "55063554A")
add_doc(w, "The Art of Computer Science", "9900333X")
return index

This function basically creates an empty index that will be stored in memory (RAMDirectory). The index uses an analyzer to parse/tokenize the documents to make them searchable — the same analyzer will be used at search time. Lucene provides an IndexWriter used to add “documents” to the index. A helper function, add_doc, takes care of adding a single document. In this example, a “document” consists of a Title and its corresponding ISBN number.

One little thing I added to make the code more pythonic was to define a closing function using a standard Python’sContextManager, which allow us to ommit the required .close() call on IndexWrite and DirectoryReader (see the search section below) objects.

The definition of the closing function is very simple:

from contextlib import contextmanager@contextmanager
def closing(thing):
"""
Simple wrapper to make Lucene's classes appear more pythonic.
"""
try:
yield thing
finally:
thing.close()

That’s all there is to creating the index.

Searching (querying) the index

The search part is also pretty straight forward:

def query(querystr, index, analyzer):
""" Search for the `querystr` in the index. """
# the "title" arg specifies the default field to use
# when no field is explicitly specified in the query.
q = QueryParser("title", analyzer).parse(querystr)
# search
hitsPerPage = 10
with closing(DirectoryReader.open(index)) as reader:
searcher = IndexSearcher(reader)
docs = searcher.search(q, hitsPerPage)
hits = docs.scoreDocs
# display results (needs reader to be open)
print("Found {:d} hits.".format(len(hits)))
for i, hit in enumerate(hits):
docId = hit.doc
d = searcher.doc(docId)
print("{:d}. {}\t{}".format(i + 1, d.get("isbn"), d.get("title")))

The search function receives the query string (querystr), an index (index) and the analyzer (analyzer)–the same analyzer we used to create the index. Using the analyzer, it parses the query string and then simply uses Lucene IndexSearcher object to look for the query in the index.

The end result

Putting all together, including the omitted import statements, yields the final Jython script:

The full Jython script, importing Lucene’s java classes and using them as if they were native Python classes

This simple script can be run from a command line as follows:

> jython demoLuceneJython.py lucene
Found 2 hits.
1. 193398817 Lucene in Action
2. 55320055Z Lucene for Dummies
> jython demoLuceneJython.py art
Found 1 hits.
1. 9900333X The Art of Computer Science

Voilà!

--

--