Lucene in a Nutshell

Alperen Kürsat Özkan
adessoTurkey
Published in
8 min readJan 3, 2023

What is Lucene.NET?

Apache Lucene is an open-source information retrieval software library that provides indexing and search technologies for documents, images, and other types of content. Lucene.NET is a version of Lucene written in the C# programming language and specifically designed for use with the .NET framework.

Lucene.NET is a powerful tool for indexing and searching large collections of text-based documents. It provides a range of features,including full-text search, faceted search, and geospatial search. It is also highly customizable and can be integrated into a wide variety of applications, including search engines, content management systems, and e-commerce platforms.

Lucene.NET is a widely used and highly respected information retrieval library and a popular choice for developers who need to add the search functionality to their .NET-based applications. It is actively maintained by a large and dedicated community of contributors, and new features and improvements are regularly released.

What’s the search algorithm used on Lucene?

The exact search algorithm used in Lucene depends on the type of query being executed. However, most search algorithms in Lucene use some variant of the vector space model, which is a mathematical model for representing the relevance of a document to a given search query.

In the vector space model, each document and query are represented as vectors in a multidimensional space, where each dimension corresponds to a word (or “term”) that appears in the document or query. The value of each dimension is determined by the term’s importance (or “weight”) in the document or query. The similarity between a document and a query is then calculated as the angle between the document and query vectors in this space.

For example, suppose we have a document with the text “The quick brown fox” and a query with the text “quick brown”; in this case, the document and query vectors would be as follows:

Document vector: [1, 1, 1, 1]

Query vector: [0, 1, 1, 0]

The similarity between the document and query vectors can then be calculated using the cosine similarity formula:

similarity = (1 * 0 + 1 * 1 + 1 * 1 + 1 * 0) / (sqrt(1 * 1 + 1 * 1 + 1 * 1 + 1 * 1) * sqrt(0 * 0 + 1 * 1 + 1 * 1 + 0 * 0))

= (0 + 2) / (2 * sqrt(2))

= 1 / sqrt(2)

The larger the value of similarity, the more similar the document and query are considered to be. In this case, the document and query have relatively low similarity, so they would not be deemed to be a very good match.

There are many variations on the vector space model, and different search algorithms in Lucene may use different variations. For example, some algorithms may use more sophisticated methods for calculating term weights, such as the tf-idf (term frequency-inverse document frequency) method, which takes into account the relative frequency of a term in the document and in the whole corpus of the documents being searched. Additionally, some algorithms may use more complex methods for combining the scores for different query terms, such as Boolean operators or query expansion.

Overall, the search algorithm used in Lucene is designed to provide fast and accurate results for a wide range of search queries.

How the reversed index works

A reversed index is a type of index that maps words or other values to the documents that contain them, rather than mapping documents to the words they contain. This allows for fast and efficient searching, as it allows you to quickly look up which documents contain a given word or value.

For example, suppose you have a collection of documents that contain the following text:

Document 1: “The quick brown fox”

Document 2: “The quick brown cat”

Document 3: “The quick red fox”

To create a reversed index for these documents, you would first need to tokenize the text, which means splitting it into individual words or “tokens”. For example, the text “The quick brown fox” might be split into the tokens: “The”, “quick”, “brown”, and “fox”.

Next, you would need to create an entry in the index for each unique token, along with a list of the documents that contain the token. For example, the index might contain the following entries:

“The”: [1, 2, 3]

“quick”: [1, 2, 3]

“brown”: [1, 2]

“fox”: [1, 3]

“cat”: [2]

“red”: [3]

Once the index has been created, you can use it to quickly look up which documents contain a given word or phrase. For example, to find which documents contain the word “quick”, you would simply look up the “quick” entry in the index and return the list of documents that contain it: [1, 2, 3].

Reversed indexes are commonly used in search engines and other systems that need to quickly look up which documents contain a given word or value. They are an important part of the underlying technology that allows search engines to quickly and efficiently return relevant results for search queries.

Some built-in methods Lucene.NET provides

● Lucene.Net.Store.Directory — The Directory is a base class that is used to provide an abstract view of a directory. There are two implementations packaged with Lucene.NET. FSDirectory works with a file directory to store the index. RAMDirectory is an in-memory directory that you can use to store the index. You can inherit from the directory class to implement your own custom directory object to store the index.

● Lucene.Net.Analysis.Analyzer — The analyzer is a base class that is responsible for breaking the text down into single words or terms, and removing any noise words, or what Lucene.NET calls stop words; stop words include “and”, “a”, “the”, etc. For now, we will just use the StandardAnalyzer class, as it’s a very good first choice. You can pass in a list of your own stop words to the constructor of the StandardAnalyzer as a string array. Using the default constructor will use the default list of stop words. You can inherit from the analyzer to implement a custom way to handle the documents that are to be indexed.

● Lucene.Net.Index.IndexWriter — The IndexWriter takes on the responsibility of coordinating the analyzer and throwing the results to the directory for storage. During the creation of the index, the writer will create some files in the directory. When we add some documents to the index writer, the index writer will use the analyzer to break down each of the fields and find a place to store the indexed document in the directory. After a session of indexing documents, it is encouraged that you optimize the index, which compacts the index for a less resource-intensive model. Also note that it is not recommended that you call Optimize for every document you add to the index, just once after an indexing session, if you can. At the end of the IndexWriter’s constructor, we specify true to create a new index. To add more documents to the index, you would specify false here, to avoid overwriting the index.

● Lucene.Net.Documents.Document — The Document class is what gets indexed by the IndexWriter. You can think of a Document as an entity that you want to retrieve; a Document could represent an email, or a web page, or a recipe, or even a CodeProject article.

● Lucene.Net.Documents.Field — The document contains a list of fields that are used to describe the document. Every field has a name and a value. Each of the field’s values contains the text that you want to make searchable. The other parts of the field’s constructor contain instructions for how to handle an individual field. The Field.Store instructions tell the IndexWriter that you want to store the field’s value inside the index, so later the value can be retrieved and acted upon, like showing the data to the user in the search results or storing an identifier value like the primary key of the object that this field’s document represents.

Other instructions are the Field.Index values, which tell the IndexWriter how to index the field (if at all). Possible values include Field.Index.TOKENIZED, meaning that we want to break down the string with the IndexWriter supplied analyzer and make it searchable. Another option is Field.Index.UN_TOKENIZED, which will still index the field but as a whole, and it is not broken down by the analyzer. The difference between storing a value and indexing the value is that when you store the value, the purpose is to be able to retrieve the value back from the index.

How to Use

In order to use Lucene.NET search functionality, we need to create a document, and store it in a given directory and with an index.

There are many field types that we can use in Lucene.NET and more information can be seen in the Lucene.NET documentation.

https://lucenenet.apache.org/docs/4.8.0-beta00005/api/Lucene.Net/Lucene.Net.Documents.FieldType.html

Here is a simple example of how to use Lucene.Net to index and search a set of documents:

1. First, you will need to install the Lucene.Net NuGet package. You can do this by opening the NuGet Package Manager in Visual Studio and searching for “Lucene.Net”.

2. After installing the Lucene.net and Lucene.net.Analysis packages. Next, create a new class for your documents.

// Create a list of documents to index
List<Document> documents = new List<Document>();documents.Add(CreateDocument("1", "Lucene in Action", "Lucene in Action is a great book about Lucene."));documents.Add(CreateDocument("2", "Lucene for Dummies", "Lucene for Dummies is a great book for beginners."));documents.Add(CreateDocument("3", "Lucene in Action, Second Edition", "The second edition of Lucene in Action is even better than the first."));

3. Create a new instance of the Lucene.Net.Store.Directory class, which represents the location where the index will be stored. There are several different types of directory that you can use, such as RAMDirectory for in-memory storage or FSDirectory for storing the index on the file system.

// Create a directory to store the index
Directory indexDirectory = FSDirectory.Open(new DirectoryInfo("index-directory"));

4. Create a new instance of the Lucene.Net.Analysis.Standard.StandardAnalyzer class, which will be used to tokenize the text in your documents.

// Create an analyzer to process the text
StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

5. Create a new instance of the Lucene.Net.Index.IndexWriter class, passing in the directory and analyzer as parameters. This will allow you to add documents to the index.

// Create an index writer
var indexWriter = new IndexWriter(indexDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

6. Afterwards, you can add documents to the given directory using indexWriter

// Add the documents to the index
foreach (Document doc in documents){    indexWriter.AddDocument(doc);}// Close the index writerindexWriter.Close();

7. To search the index, create a new instance of the Lucene.Net.Search.Searcher class, passing in the directory as a parameter. Then, create a new instance of the Lucene.Net.QueryParsers.QueryParser class, passing in the field that you want to search and the analyzer as parameters. Finally, call the Searcher.Search method, passing in the query as a parameter. This will return a set of results.

// Create an index reader
  IndexReader indexReader = IndexReader.Open(indexDirectory, true);  // Create an index searcher  var indexSearcher = new IndexSearcher(indexReader);  // Search for the term "lucene"  var query = new TermQuery(new Term("text", "lucene"));  TopDocs topDocs = indexSearcher.Search(query, 10);  // Iterate over the results  foreach (ScoreDoc scoreDoc in topDocs.ScoreDocs)  {      Document doc = indexSearcher.Doc(scoreDoc.Doc);      Console.WriteLine("{0}: {1}", doc.Get("id"), doc.Get("title"));  }  // Close the index reader and directory  indexReader.Close();  indexDirectory.Close();

Lucene.Net.Search.TopDocs which contain the search results.

You can checkout this repository to see more advanced implementation:

https://github.com/adessoTurkey-dotNET/LucInANutshell

It utilizes Reflection and stores and retrieves from documents to given types.

References

https://lucenenet.apache.org/contributing/index.html

https://www.codeproject.com/Articles/29755/Introducing-Lucene-Net

--

--