Futures, meet Lucene

A while back I ran into a strange little problem, I needed to create a Lucene index with not only the contents of files but their hashes as well. It seemed like something that should be simple — and at first it was; I read through the file once with an InputStream to generate the hash, then wrapped a new InputStream in a Tika ParsingReader and passed it to Lucene so the file’s contents could be indexed.

It worked and everything was lovely until I ran the benchmarks, then I was sad for a while. So I got to thinking, “We’re already reading through the file, and doing it exactly in sequence, with no seeking or fancy offset stuff; it seems like we should be able to squeeze a little more utility out of those expensive disk reads.” As it turns out, we can.

What we do is wrap the input stream that we’re passing into the Tika ParsingReader in a filter class to do the hashing, this way, when Lucene reads the file in to do the indexing the hash will get generated as a side effect. Great! We’re all done, right? Well, not quite; we still have to include that hash as a Field in the indexed Document. The problem here is that the file is only read (and therefore hashed) when the Document is written to the index, at which point Fields can no longer be added except with some very expensive calls that effectively delete the document then re-write it back into the index. For obvious reasons I felt like this was too much work to do something relatively simple — so I went searching. I looked through the Lucene source and wrote things on whiteboards that confused my colleagues and upset small children, but at last I found what I was looking for; it turns out that when Lucene is indexing a Document, it reads Fields single threaded in the order in which they were added to the Document. The Lucene source necessary to demonstrate this is a little involved so it will not be included here, but if you’re curious some good places to start would be here, here, and here.

This means that we can use the idea of a future to add a Field to Lucene, even though it doesn’t exist yet, because we know for a fact that in all cases it will exist by the time Lucene gets around to reading it. The way we do this is by subclassing Field and adding a little magic.

You may be asking why we didn’t simply subclass TextField since we would desperately like to add this as a TextField (some of the more advanced Lucene programmers may instead be asking why we didn’t want to make this look more like a StringField. Hint: fingerprints). The problem with that is that TextField is written under the assumption that you are adding a Field that contains Text — not an unreasonable assumption but definitely an inconvenient one for our purposes. Having done this, we have only to implement a Future<String> for our HashStream to return and we should be off to the races.

Observant readers will note that there is no sleeping, waiting, or locking of any kind in this Future. This could be considered a technically incomplete implementation of Future, while it may be that it is also free of all synchronization overhead and will always work correctly for this application.

All we have left to do now is write a (significantly simpler) main method.

Now we run the benchmarks again — then jump for joy and go for a drink.

Josh Hight used to be a Creative Electron Organizer at ThinAir Labs. He now works at Rubrik. You can find his latest open source project at github.com/joshbooks/JoshDB. He will happily respond to virtually any question posted in the comments. The source code for this piece can be found at https://github.com/ThinAir/LuceneFutures