In-memory Search and Autocomplete with Lucene 8.5

Ekaterina Mihaylova
3 min readApr 27, 2020

--

Recently I had to implement in memory search and autocomplete. I pulled the latest Lucene (8.5) and I started writing what I remembered from before. When this didn’t work I started searching for useful tutorials. This didn’t work either, so I wrote something myself. This is what I learned on the way.

It turns out that things have changed quite a bit in Lucene 8. RAMDirectory is deprecated. This was the go to class for in memory indexing. Apparently it was overused. Instead of it it is recommended to use MMapDirectory. If you read a bit more about it you will find that there is the abstract class FSDirectory and its 3 children SampleFSDirectory, NIOFSDirectory and MMapDirectory. If you read the description though:

Unfortunately, because of system peculiarities, there is no single overall best implementation. Therefore, we’ve added the open(java.nio.file.Path) method, to allow Lucene to choose the best FSDirectory implementation given your environment, and the known limitations of each implementation. For users who have no reason to prefer a specific implementation, it's best to simply use open(java.nio.file.Path)

And also all this implementations need to have Path to directory on the machine to store the index. And I really wanted to use an in-memory index. Enter ByteBuffersDirectory. There are 2 warnings with it:

Note that MMapDirectory is nearly always a better choice as it uses OS caches more effectively (through memory-mapped buffers). A heap-based directory like this one can have the advantage in case of ephemeral, small, short-lived indexes when disk syncs provide an additional overhead.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Still it was good enough for me. Here is the index part:

// Index that is using stopwords and stems for better search functionality
ByteBuffersDirectory directory = new ByteBuffersDirectory();

try (IndexWriter directoryWriter = new IndexWriter(directory, new IndexWriterConfig(new BulgarianAnalyzer()))) {

// Reading the input data from the csv
.....
for (CSVRecord line : icdParser) {
Document doc = new Document();
doc.add(new StoredField(ID, line.get(0)));
doc.add(new TextField(DESCRIPTION, line.get(1), Field.Store.YES));
directoryWriter.addDocument(doc);
}
}

DirectoryReader indexReader = DirectoryReader.open(directory);
dirSearcher = new IndexSearcher(indexReader);

And the search part:

QueryParser parser = new QueryParser(DESCRIPTION, new BulgarianAnalyzer());
Query query = parser.parse(keyword);
TopDocs topDocs = dirSearcher.search(query, 10);
List<ICDEntity> icdEntities = new ArrayList<>();
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = dirSearcher.doc(scoreDoc.doc);
ICDEntity icdEntity = new ICDEntity();
icdEntity.setId(document.get(ID));
icdEntity.setDescription(document.get(DESCRIPTION));
icdEntities.add(icdEntity);
}

return icdEntities;

Now onto the autocomplete part. Tutorials so far tell you to get all the terms in your index then create another index with additional filter that adds all token substrings to the index and then search it. With the changes of the API getting all the terms in your index is not a trivial task. Fortunately there is implementation of a suggester so you don’t need to worry of too many implementational details. Here is the code for the index:

public void initialize() throws IOException {
// The autocomplete analyzer needs to be simple so the suggestions are as they are in the text.
Analyzer autocompleteAnalyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String s) {
Tokenizer source = new StandardTokenizer();
TokenFilter filter = new LowerCaseFilter(source);
return new TokenStreamComponents(source, filter);
}
};
ByteBuffersDirectory autocompleteDirectory = new ByteBuffersDirectory();

try (IndexWriter autocompleteDirectoryWriter = new IndexWriter(autocompleteDirectory,
new IndexWriterConfig(autocompleteAnalyzer))) {

// Reading the input data from the csv
...
for (CSVRecord line : icdParser) {
Document autocompleteDoc = new Document();
autocompleteDoc.add(new TextField(DESCRIPTION, line.get(1), Field.Store.YES));
autocompleteDirectoryWriter.addDocument(autocompleteDoc);
}
}

// Using Lucene's suggester for the autocomplete functionality
buildAnalyzingSuggester(autocompleteDirectory, autocompleteAnalyzer);
}

public void buildAnalyzingSuggester(Directory autocompleteDirectory, Analyzer autocompleteAnalyzer)
throws IOException {
DirectoryReader sourceReader = DirectoryReader.open(autocompleteDirectory);
LuceneDictionary dict = new LuceneDictionary(sourceReader, DESCRIPTION);
analyzingSuggester = new AnalyzingSuggester(autocompleteDirectory, "autocomplete_temp",
autocompleteAnalyzer);
analyzingSuggester.build(dict);
}

I have used AnalyzingSuggester but there are also other suggesters that you might want to take a look in org.apache.lucene.search.suggest.analyzing package.

Here is the search part of the autocompleter:

public List<String> suggestTermsFor(String term) throws IOException {
List<Lookup.LookupResult> lookup = analyzingSuggester.lookup(term, false, 5);
List<String> suggestions = lookup.stream().map(a -> a.key.toString()).collect(Collectors.toList());

return suggestions;
}

The used libraries:

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>8.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>8.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>8.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-suggest</artifactId>
<version>8.5.0</version>
</dependency>

And you can find the code here.

--

--