Implementing full-text search in Apache Pinot

Published in

Apache Pinot Developer Blog

15 min readJun 9, 2020

Apache Pinot is a real-time distributed OLAP datastore, built to deliver scalable real time analytics with low latency.

Pinot supports super fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index and sorted index. However, arbitrary text search queries cannot leverage indexes and require a full table scan.

In this post, we will discuss newly added support for text indexes in Pinot and how they can be used for efficient full-text search queries.

Let’s take a few examples to understand this better.

Exact match with scan

SELECT COUNT(*) FROM MyTable WHERE firstName = “John”

In the above query, we are doing an exact match on firstName column that doesn’t have an index. The execution engine will find the matching docIds (aka rowId) as follows:

Exact match with inverted index

If there is an inverted index on the firstName column, the dictionaryId will be used to look up the inverted index instead of scanning the forward index.

Exact match with sorted index

If the table is sorted on column firstName, we use the dictionaryId to look up the sorted index and get the start and end docIds of all rows that have the value “John”.

The following graph shows latencies for exact match queries with and without index on a dataset of 500 million rows and selectivity (number of rows that passed the filter) of 150 million.

Text search with scan

What if the user is interested in doing arbitrary text search? Pinot currently supports this through the in-built function REGEXP_LIKE.

SELECT COUNT(*) FROM MyTable WHERE REGEXP_LIKE(firstName, ‘John*’)

The predicate is a regular expression (prefix query). Unlike exact matches, indexes can’t be used to evaluate the regex filter and we resort to full table scan. For every raw value, pattern matching is done with regex “John*” to find the matching docIds.

Text search with index

For arbitrary text data which falls into the BLOB/CLOB territory, we need more than exact matches. Users are interested in doing regex, phrase and fuzzy queries on BLOB like data. As we just saw, REGEXP_LIKE is very inefficient since it uses a full-table scan. Secondly, it doesn’t support fuzzy searches.

In Pinot 0.3.0, we added support for text indexes to efficiently do arbitrary text search on STRING columns where each column value is a BLOB of heterogeneous text. Doing standard filter operations (equality, range, between, in) doesn’t fit the bill on textual data.

Text search can be done in Pinot using the new in-built function TEXT_MATCH.

SELECT COUNT(*) FROM Foo 
WHERE TEXT_MATCH (<column_name>, <search_expression>)

With support for text indexes, let’s compare the performance of text search query with and without index on a dataset of 500 million rows and filter selectivity of 150 million.

Text Indexing Problem

Like other database indexes, the goal of a text index is efficient filter processing to improve query performance.

To support regular expression, phrase and fuzzy searches efficiently, text index data structure should have few fundamental building blocks to store key pieces of information.

Dictionary

Dictionary maps each indexed term (word) to the corresponding dictionaryId to allow efficient comparison using fixed-width integer codes as compared to raw values.

Inverted Index

Inverted index maps dictionaryId of each indexed term to the corresponding docId. Exact match term queries (single or multiple terms) can be answered efficiently through dictionary and inverted index.

Position Information

Phrase queries (e.g find documents matching the phrase “machine learning”) are an extension of exact term queries where the terms should be in the exact same order in the matching documents. These queries need position information along with the dictionary and inverted index.

Automata for regex and fuzzy queries

Regex queries (including prefix, wildcard) and fuzzy queries will require a comparison with each and every term in the dictionary unless the prefix is fixed. There has to be a way to prune the comparisons.

If we can represent the input regular expression as a finite state machine such that the state machine is deterministic and accepts a set of terms then we can use the state machine in conjunction with the dictionary to get the dictionaryIds of all the matching terms that are accepted by the state machine.

Fuzzy edit distance search can also be done efficiently by representing the query as a state machine based on Levenshtein automata and intersecting the automata with the dictionary.

As discussed earlier, Pinot’s dictionary and inverted index can help answer exact match term queries efficiently. However, phrase, regex, wildcard, prefix and fuzzy queries require the position information and finite state automata which is currently not maintained in Pinot.

We learned that Apache Lucene has the necessary missing pieces and decided to use it for supporting full-text search in Pinot until we enhance our index structures.

Creating Text Indexes in Pinot

Let’s discuss creation of Lucene text indexes in Pinot as a series of key design decisions and challenges.

Text index per column

Pinot’s table storage format is columnar. Index structures (forward, inverted, sorted, dictionary) for the table are also created on a per column, per segment (shard) basis. For text index, we decided to stick with this fundamental design for the following reasons:

Evolution and maintenance is easier. A user has the freedom to enable or disable text indexing on each column of the table.
Our performance experiments revealed that creating a global index in Lucene across all text index enabled columns for the table hurts performance. Global index is larger than a per column index which increases the search time.

Text Index Format

Like other indexes, a text index is created as part of Pinot segment creation. For each row in the table, we take the value of the column that has text indexing enabled and encapsulate it in a document.

The document comprises of two fields:

Text field — contains the actual column value (docValue) representing the body of text that should be indexed.
Stored field — contains a monotonically increasing docId counter to reverse map each document indexed in Lucene back to its docId (rowId) in Pinot. This field is not tokenized and indexed. It is simply stored inside Lucene.

Storing Pinot DocId in Lucene Document

The stored field is critical. For each document added to the text index, Lucene assigns a monotonically increasing docId to the document. Later the search operation on the index returns a list of matching docIds.

Lucene index is composed of multiple independent sub-indexes called as segments (not to confuse with the Pinot segment). Each Lucene sub-index is an independent self-contained index. Based on the size of data indexed, how often the in-memory documents from the index are flushed to the on-disk representation, a single text index can consist of multiple sub-indexes.

The key thing to note here is that Lucene’s internal docIds are relative to each sub-index. This can lead to situations where a document added to the text index for a given row in the Pinot table does not have the same Lucene docId as Pinot docId.

For a query that has a text search filter, this will lead to incorrect results since our execution engine (filter processing, index lookup etc) is based around the docIds. So we need to uniquely associate each document added to the Lucene index with the corresponding Pinot docId. This is why StoredField is used as the second field in the document.

Text Analyzer

Plain text is used as input for index generation. An analyzer performs pre-processing steps on the provided input text.

Lower casing
Breaks text into indexable and searchable tokens/terms.
Prunes stop words (a, an, the, or etc)

We currently use StandardAnalyzer which is good enough for standard english alphanumeric text and uses Unicode text segmentation algorithm to break text into tokens. Analyzer is also used during query execution when searching the text index.

Text Index Creation for both Offline and Real-time

Pinot supports ingesting and querying data in real-time. Text indexes are supported for offline, real-time and hybrid Pinot tables.

IndexWriter is used to create text indexes. It buffers the documents in memory and periodically flushes them to the on-disk Lucene index directory. However, the data is not visible to IndexReader (used on the search query path) until the writer commits and closes the index which fsync’s the index directory and makes the index data available to the reader.

IndexReader always looks at a point-in-time snapshot (of committed data) of the index as of the time reader was opened from the index directory. This works well for offline tables since offline Pinot segments don’t serve data for queries until fully created and are immutable once created. The text index is created during pinot segment generation and is ready to serve data for queries after the segment is fully built and loaded (memory mapped) on Pinot servers. Thus the text index reader on the query path always looks at the entire data of a segment for offline tables.

However, the above approach will not work for real-time or hybrid Pinot tables since these tables can be queried while data is being consumed. This requires the ability to search the text index on the query path as the IndexWriter is in progress with uncommitted changes. Further sections will discuss the query execution for both offline and real-time in detail.

Querying Text Indexes in Pinot

We enhanced our query parser and execution engine with a new in-built function text_match() to be used in WHERE clause of the queries. The syntax is:

TEXT_MATCH(<columnName>, <searchExpression>)

columnName: Name of the column to do text search on.
searchExpression: search query in accordance with Lucene query syntax.

Let’s take an example of a query log file and resume file:

Store the query log and resume text in two STRING columns in a Pinot table.
Create text indexes on both columns.

We can now do different kinds of text analysis on the query log and resume data:

Count the number of group by queries that have between filter on timecol

SELECT count(*) FROM MyTable 
WHERE text_match(logCol, ‘\”timecol between\” AND \”group by\”’)

Count the number of candidates that have “machine learning” and “gpu processing”

SELECT count(*) FROM MyTable 
WHERE text_match(resume, ‘\”machine learning\” AND \”gpu processing\”’)

Please see the user docs for an extensive guide on different kinds of text search queries and how to write search expressions.

Creating Text Index Reader for Offline Pinot Segments

Text index is created in a directory by IndexWriter as part of pinot segment generation. When the pinot servers load (memory map) the offline segment, we create an IndexReader which memory-maps the text index directory. An instance of IndexReader and IndexSearcher is created once per table segment per column with text index.

We chose to go with MMapDirectory instead of RAMDirectory since the former uses efficient memory mapped I/O and generates less garbage. RAMDirectory can be very efficient for small memory-resident indexes but increases the heap overhead significantly.

Text Filter Execution

Following diagram depicts segment level execution for the following text search query

SELECT count(*) from Table 
WHERE text_match(textCol1, expression1) 
AND text_match(textCol2, expression2)

Creating Text Index Reader for Realtime Pinot Segments

Text indexes in realtime Pinot segments can be queried while the data is being consumed. Lucene supports NRT (near real-time) search by allowing to open a reader from a live writer thereby letting the reader to look at all the uncommitted index data from the writer. However, just like any other index reader in Lucene, the NRT reader is also a snapshot reader. So, the NRT reader will have to be reopened periodically to see the incremental changes made by the live index writer.

Our real-time text index reader also acts as a writer since it is both adding documents to the index as part of real-time segment consumption and being used by the Pinot query threads.
During Pinot server startup, we create a single background thread. The thread maintains a global circular queue of real-time segments across all tables.
The thread wakes up after a configurable threshold, polls the queue to get a realtime segment and refreshes the index searcher of the real-time reader for each column that has a text index.

How often should the refresh happen?

Deciding the configurable threshold between successive refreshes by the background thread is something that should be tuned based on the requirements.

If the threshold is low, we refresh often and queries with text_match filter(s) on consuming segments will get to see the new rows quickly. The downside is lots of small I/Os since refreshing the text index reader requires a flush from the live writer.
If the threshold is high, we flush less often which increases the lag between the time a row was added to the consuming segment’s text index and appears in search results of the query with text_match filter.
It is a trade-off between consistency and performance.

Key Optimizations

So far, we discussed how text index is created and queried in Pinot. We also talked about a few design decisions and challenges. Now, let’s discuss details on optimizations we implemented to get the desired functionality and performance.

Using Collector

For a search query, Lucene’s default behavior is to do scoring and ranking. The result of the call to indexSearcher.search() is TopDocs which represents top N hits of the query sorted by score descending. In Pinot we currently don’t need any of the scoring and ranking features. We are simply interested in retrieving all the matched docIds for a given text search query.

Our initial experiments revealed that the default search code path in Lucene results in significant heap overhead since it uses a PriorityQuery in TopScoreDocCollector. Secondly, the heap overhead increases with the increase in the number of matching documents.

We implemented the Collector interface to provide a simple callback to indexSearcher.search(query, collector) operation. For every matching Lucene docId, Lucene calls our collector callback which stores the docId in a bitmap.

Pruning Stop Words

Text documents are very likely to have common english words like a, an, the, or etc. These are known as stop-words. Stop words are typically never used in text analysis but due to their high occurrence frequency, index size can explode which consequently hurts query performance. We can customize the Analyzer to create custom token filters for the input text. The filtering process in the analyzer prunes all the stop words while building the index.

Using a pre-built mapping of Lucene docId to Pinot docId

As discussed above, there is a strong need to store Pinot docId in every document added to the Lucene index. This results in a two-pass query execution:

The search operation returns a bitmap of matching lucene docIds.
Iterate over each docId to get the corresponding document. Retrieve the pinot docId from the document.

Retrieving the entire document from Lucene was a CPU hogger and became a major bottleneck for throughput testing. To avoid this, we iterate the text index once to fetch all <lucene docId, pinot docId> mappings and write them in a memory mapped file.

Since the text index for offline segments is immutable, this works well as we pay the cost of retrieving the entire document just once when the server loads the text index. The mapping file is later used during query execution by the collector callback to short-circuit the search path and directly construct a result bitmap of pinot docIds.

This optimization along with pruning the stop-words gave us 40–50x improvement in query performance by allowing the latency to scale with increase in QPS. The following graph compares the latency before and after this optimization.

Disable Lucene Query Result Cache

Lucene has a cache to boost performance for queries with repeatable text-search expressions. While the performance improvement is noticeable, cache increases the heap overhead. We decided to disable it by default and let the user enable (if need be) on a per text index basis.

Use compound file format

Lucene’s on-disk index structures are stored in multiple files. Consider the case of 2000 table segments on a Pinot server, each Pinot table segment having text index on 3 columns with 10 files per text index. We are looking at 60k open file handles. It is very likely for the system to run into “too many open files” problem.

So, the IndexWriter uses compound file format. Secondly, when the text index is fully built for a column, we force merge the multiple lucene sub-indexes (which are also referred to as segments in Lucene terminology) into a single index.

Configure in-memory buffer threshold

As documents are added to the text index during Pinot segment generation, they are buffered in-memory and periodically flushed to the on-disk structures in the index directory. The default Lucene behavior is to flush after memory usage has reached 16MB . We experimented with this value and made some observations:

A flush results in a Lucene segment. As more of these are created, Lucene can decide to merge few/all of them in the background. Having multiple such segments increases the number of files.
Having a default threshold as 16MB doesn’t strictly mean the index writer will consume 16MB of heap before flushing. The actual consumption is much higher (around 100MB) presumably because in Java there is no good way to programmatically keep track of the amount of heap memory used.
Smaller thresholds result in a large number of small I/Os as opposed to fewer big I/Os. We decided to keep this value configurable and chose 256MB as the default to keep a good balance between memory overhead and number of I/Os.

Additional Performance Numbers

We also ran micro-benchmarks to compare the execution time of text_match and regexp_like on a Pinot table with a single segment containing 1 million rows. Two different kinds of test data were used:

Log data: A STRING column in Pinot table where each value is a log line from apache access log.
Non log data: A STRING column in Pinot table where each value is resume text.

The following graph shows that search queries using text index are significantly faster compared to scan based pattern matching.

Another evaluation was done with Pinot’s native inverted index to understand when using text index may not be the right solution.

White-space separated text can be stored as a multi-value STRING column in Pinot.
Pinot will create a dictionary and inverted index on this column.
If only exact term matches (using =, IN operators) are required, then text index is not the right solution. Pinot’s inverted index can do the exact term matches 5x faster than Lucene.
However, if a phrase, regex (including prefix and wildcard) or fuzzy search is needed, then text index is the right choice both functionality and performance wise.

Upcoming Work

Pre-built mapping of lucene docId to pinot docId works for offline segments since the text index is immutable. For real-time consuming segments, this optimization is not applicable since the index is changing while it is serving queries. Optimizing the Lucene docId to Pinot docId translation is work in progress.

Fine-tuning the background refresh thread to work on a per table or a per index basis. The current implementation has a single background thread to manage all realtime segments and their text indexes.

Conclusion

In this blog post, we discussed how we leveraged Lucene to engineer the text search solution in Pinot to meet our functional and performance (QPS and latency) requirements. Please visit the user documentation of text search to learn more about using the feature.

If you’re interested in learning more about Apache Pinot, these resources are great places to get started.

Docs: http://docs.pinot.apache.org
Getting Started: https://docs.pinot.apache.org/getting-started

Special thanks

I would like to thank our Pinot OSS team for their relentless efforts to make Pinot better: Mayank Shrivastava, Jackie Jiang, Jialiang Li, Kishore Gopalakrishna, Neha Pawar, Seunghyun Lee, Subbu Subramaniam, Sajjad Moradi, Dino Occhialini, Anurag Shendge, Walter Huf, John Gutmann, our engineering manager Shraddha Sahay and SRE manager Prasanna Ravi. We would also like to thank the LinkedIn leadership Eric Baldeschwieler, Kapil Surlaker, and Igor Perisic for their guidance and continued support as well as Tim Santos for technical review of this article.