Lucene + Jython = profit
The power of Lucene … without your eyes bleeding
For an internal research project here at Shazam we needed to search a big text corpus. I’m not an NLP person, but we have some experts in the area at Shazam’s RAD team (companies usually call it R&D, but we aren’t just any company). After posing the problem to them, they immediately suggested to use an inverted index. They even recommended Apache Lucene for the task.
While I’ve done java programming in the past, I really don’t enjoy it (it’s been said that reading java code can make your eyes bleed!). “Maybe there’s a Python version” I thought. In fact, there is an official one, but the installation instructions discouraged me (not too complicated, really, but I prefer the simple pip install <package>
). After pondering other alternatives, I remembered Jython.
There’s a good chance that you haven’t heard about Jython. It can be really handy in situations when you want/need to write Python code, but there’s a library/package written in Java that you could use. It has been about 10 years since a good friend of mine mentioned it (he was working on adding django support to Jython for a Goolge SoC). Since then, I never really had a need for Jython, but I was well aware of it. Finally, the time came.
What the hell is Jython, you ask? In short, Jython is a Python interpreter that allows you to import and use straight java code. Why would that be useful, you inquire? Well, the Jython wiki has a few reasons, but the one that jumped to me was:
Rapid application development — Python programs are typically 2–10X shorter than the equivalent Java program. This translates directly to increased programmer productivity. The seamless interaction between Python and Java allows developers to freely mix the two languages both during development and in shipping products.
So I decided to give it a shot.
A really simple example
Since I was working in a domain out of my knowledge zone-of-comfort, with tools that I haven’t used before, I decided to implement a simple demo as a proof of concept. It is based on this Lucene in 5 minutes tutorial, reimplemented in a pythonic way.
The Jython side of things
Before start, I had to install Jython. In OS X this is as simple as brew install jython
(I’m sure it is just as easy on other platforms). With that in place you can run a Python script the same way you’d do with standard Python, e.g. jython demo.py
. This command will execute any standard Python script. Of course, the real power comes from the ability to use java packages as if they were Python ones. The process is fairly simple:
- Add the
.jar
files to Python’s search path (I should say Jython’s search path); - Import what you need as if it was a standard Python module;
- Profit!
For example:
import sys
sys.path.append("lucene-7.1.0/core/lucene-core-7.1.0.jar")# Now that jars are in the path, we can import java code as if it
# was regular Python!
from org.apache.lucene.analysis.standard import StandardAnalyzeranalyzer = StandardAnalyzer()
This simple script won’t do anything useful, but if you can run it without getting exceptions (run by typing jython dumb_script.py
in a terminal), you are good to go! That’s basically all you need for a Jython script to use java code. The rest of this post will briefly explain how to use Lucene for an inverted index search.
The Lucene side of things
The functionality required for this demo lies in 2 jar files from Lucene, so you will need to add those two to the Jython’s path.
import sys
jars = [
"lucene-7.1.0/core/lucene-core-7.1.0.jar",
"lucene-7.1.0/queryparser/lucene-queryparser-7.1.0.jar",
]
for jar in jars:
sys.path.append(jar)
The demo creates a basic search index and then shows how this index can be used to search given a query string.
Index creation
Let’s look at the index creation first (I’ll skip the imports for now, but the full code can be found at the end):
def make_index(analyzer):
""" Create an inverted index to power the search. """ def add_doc(w, title, isbn):
""" Utility to add "documents" to the index. """
doc = Document()
doc.add(TextField("title", title, Field.Store.YES))
# use a string field for isbn because we don't
# want it tokenized
doc.add(StringField("isbn", isbn, Field.Store.YES))
w.addDocument(doc) # create the index
index = RAMDirectory() config = IndexWriterConfig(analyzer) with closing(IndexWriter(index, config)) as w:
add_doc(w, "Lucene in Action", "193398817")
add_doc(w, "Lucene for Dummies", "55320055Z")
add_doc(w, "Managing Gigabytes", "55063554A")
add_doc(w, "The Art of Computer Science", "9900333X") return index
This function basically creates an empty index that will be stored in memory (RAMDirectory
). The index uses an analyzer to parse/tokenize the documents to make them searchable — the same analyzer will be used at search time. Lucene provides an IndexWriter
used to add “documents” to the index. A helper function, add_doc,
takes care of adding a single document. In this example, a “document” consists of a Title and its corresponding ISBN number.
One little thing I added to make the code more pythonic was to define a closing
function using a standard Python’sContextManager
, which allow us to ommit the required .close()
call on IndexWrite
and DirectoryReader
(see the search section below) objects.
The definition of the closing
function is very simple:
from contextlib import contextmanager@contextmanager
def closing(thing):
"""
Simple wrapper to make Lucene's classes appear more pythonic.
"""
try:
yield thing
finally:
thing.close()
That’s all there is to creating the index.
Searching (querying) the index
The search part is also pretty straight forward:
def query(querystr, index, analyzer):
""" Search for the `querystr` in the index. """ # the "title" arg specifies the default field to use
# when no field is explicitly specified in the query.
q = QueryParser("title", analyzer).parse(querystr) # search
hitsPerPage = 10
with closing(DirectoryReader.open(index)) as reader:
searcher = IndexSearcher(reader)
docs = searcher.search(q, hitsPerPage)
hits = docs.scoreDocs # display results (needs reader to be open)
print("Found {:d} hits.".format(len(hits)))
for i, hit in enumerate(hits):
docId = hit.doc
d = searcher.doc(docId)
print("{:d}. {}\t{}".format(i + 1, d.get("isbn"), d.get("title")))
The search function receives the query string (querystr
), an index (index
) and the analyzer (analyzer
)–the same analyzer we used to create the index. Using the analyzer, it parses the query string and then simply uses Lucene IndexSearcher
object to look for the query in the index.
The end result
Putting all together, including the omitted import statements, yields the final Jython script:
This simple script can be run from a command line as follows:
> jython demoLuceneJython.py lucene
Found 2 hits.
1. 193398817 Lucene in Action
2. 55320055Z Lucene for Dummies> jython demoLuceneJython.py art
Found 1 hits.
1. 9900333X The Art of Computer Science
Voilà!