Apache Lucene™ is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Lucene is a library that allows the user to index textual data (Word & PDF documents, emails, webpages, tweets etc). It allows you to add search capabilities to your application. There are two main steps that Lucene performs:
- Create an index of documents you want to search.
- Parse query, search index, return results.
Lucene uses an inverted index (mapping of a term to its metadata). This metadata includes information about which files contain this term, number of occurrences etc. The fundamental units of indexing in Lucene are the Document and Field classes:
- A Document is a container that contains one or more Fields.
- A Field stores the terms we want to index and search on. It stores a mapping of a key (name of the field) and a value (value of the field that we find in the content).
Here is a diagram describing the steps Lucene takes when indexing content (Source: Lucene in Action, Figure 2.1).
In order to be able to index documents of various types, Lucene needs to be able to extract the test from the given document into a format that it can parse. Apache Tika is one framework that parses documents and extracts text content.
This process filters and cleans up the text data. The text data goes through several steps (for example: extracting words, removing common (stop) words, make words lowercase etc) and converts the text into tokens that can be added to the index. The picture to the right shows the indexing process which results in an inverted index being stored on the underlying filesystem. See below for an example of an inverted index.
Lucene uses an inverted index data structure for storing the Fields we want to search on. An inverted index uses the tokens as the lookup key to find the documents which contains that token. It maps the content to its location. The index can be physically stored as part of a Directory (either in a file system or in memory).
Below is an example of an inverted index. Logically, this represents the result of the indexing process.
Once our documents are indexed, we will need to add search functionality. All queries of the index are done through the IndexSearcher. Given a search expression, we parse the query, create a QueryParser and search the index for results. The results are returned as TopDocs which contain ScoreDocs, which contain the document IDs and the confidence scores of the results that match the query. The fundamental classes for searching are:
- IndexSearcher — Provides “read-only” access to the index. Exposes several search methods that take in a Query object and return the top n “best” TopDocs as the result. This class is the counter part to the IndexWriter class used for creating/updating indexes.
- Term — Basic unit for searching. Counter part to the Field object used in indexing. We create a certain Field when indexing (for ex: “Name” : “Chuck Norris”) and we use Terms in a TermQuery when searching. It contains the same mapping from the name of the field to the value
- Query: Lucene provides several types of Queries, including TermQuery, BooleanQuery, PrefixQuery, WildcardQuery, PhraseQuery, and FuzzyQuery. Each type of query provides a unique way of searching the index.
- QueryParser: Parses a human-readable query (for ex: “opower AND arlington”) into Query object that can be used for searching.
- TopDocs — Container for pointers to N search results. Each TopDoc contains a document ID and a confidence score.
Here is an overview of the process described above:
This concludes a high-level introduction to Apache Lucene. In future posts, I will explore Solr and give an example of using Lucene in a real application. The inspiration for this series is derived from a meetup of the Washington D.C. Hadoop Users Group in which Douglas Cutting spoke about Lucene.
McCandless, Michael; Hatcher, Erik; Gospodnetić, Otis (2010). Lucene in Action, Second Edition. Manning.