Elasticsearch — Text fields analyzers

Eleonora Fontana
Betacom
Published in
6 min readJan 11, 2021
Photo by Gabriel Sollmann on Unsplash

Introduction

In Elasticsearch when a new document is indexed, all textual values are analyzed so that they can be saved in the best-efficient data structure. Such a process is done by analyzers.

In this article we will study the Elasticsearch analyzers and learn how to customize them. By the end of the reading, you will also be able to create your own analyzer and use it for document indexing and query execution.

If you missed pur previous articles about the Elasticsearch engine, take a look at the Betacom page.

Definitions

An analyzer is composed of three lower-level building blocks: character filters, tokenizers, and token filters.

The character filter is disabled by default and transforms the original text by adding, deleting or changing characters. An analyzer may have zero or more character filters, which are applied in order. For instance, a character filter could be used to convert Hindu-Arabic numerals into their Arabic-Latin equivalents or to strip HTML elements.

A tokenizer receives the character filter output as a stream of characters, breaks it up into individual tokens and returns a stream of tokens. For example, the whitespace tokenizer would convert the text “I love dogs!” into the tokens [“I”, “love”, “dogs!”]. The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents. An analyzer must have exactly one tokenizer.

A token filter is used to add, remove or change tokens from the input token stream. The default one is the lowercase token filter which converts all tokens to lowercase. Token filters are not allowed to change the position or character offsets of each token. An analyzer may have zero or more token filters, which are applied in order.

By using the Analyze API we can perform analysis on a text and look at the resulting tokens. Such a test analysis can be done specifying either the analyzer name or the character filter, tokenizer and token filter:

GET _analyze
{
“text”: “...”,
“analyzer”: “standard”
}

or

GET _analyze
{
“text”: “...”,
“char_filter”: [],
“tokenizer”: “standard”,
“filter”: ["lowercase"]
}

The standard analyzer used in both the previous examples is the default one and analyzes the sentence “I love dogs!” as follows:

Inverted index

Now that we stated the basics of text analysis in Elasticsearch, let’s take a look at what actually happens with the result, i.e. the tokens.

The data structure in which field values are saved depends on the data type. One of these data structures is the inverted index: it lists every unique word that appears in any document and identifies all of the documents each word occurs in. This clearly is a data structure that makes full-text search very efficient. An inverted index is created for each textual field of the documents.

Consider for instance the sentences “I love dogs” and “Micheal has two dogs” and suppose they were stored in two different documents. The inverted indices of such documents will then be:

Imagine we would like to perform a search for the term “love”. Figuring out which documents contain that term corresponds to performing a simple lookup in the inverted index. Doing that, we can easily see that document #1 contains the term. That makes the process of searching for a term very easy and efficient.

The reason the index is called “inverted” is just that the more logical mapping would be to have a mapping from documents to the terms they contain, i.e. the other way around. That doesn’t provide the fast lookups that we need, so that’s why the relationship is inverted.

Built-in and custom analyzers

Elasticsearch supports some built-in analyzers. Let’s take a look at them, using the example sentence “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”.

  • The standard analyzer splits text into terms on word boundaries, removes most punctuation, lowercases terms and supports removing stop words. The example sentence will then be transformed into the following list of tokens: [ “the”, “2”, “quick”, “brown”, “foxes”, “jumped”, “over”, “the”, “lazy”, “dog’s”, “bone ”].
    If the stop words removal is enabled, the standard analyzer will be called stop analyzer. The example sentence will become [ “quick”, “brown”, “foxes”, “jumped”, “over”, “lazy”, “dog”, “s”, “bone ”]
  • The simple analyzer splits the text every time it encounters a non-letter character and lowercases all terms. The example sentence will be pre-processed as [ “the”, “quick”, “brown”, “foxes”, “jumped”, “over”, “the”, “lazy”, “dog”, “s”, “bone ”]
  • The whitespace analyzer divides text into terms whenever it encounters any whitespace character and does not lowercase terms. The example sentence will be analyzed as [ “The”, “2”, “QUICK”, “Brown-Foxes”, “jumped”, “over”, “the”, “lazy”, “dog’s”, “bone”. ]
  • The keyword analyzer is used for the keyword fields and accepts whatever text it is given and outputs the exact same text as a single term. It is also called the “noop” analyzer since no operation is performed. The example sentence will then be [ “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.” ]
  • The pattern analyzer uses a regular expression to split the text into terms. The default regex is “\W +” which refers to all non-letter characters. It also supports lower-casing and stop words. The example sentence will be transformed into [ “the”, “2”, “quick”, “brown”, “foxes”, “jumped”, “over”, “the”, “lazy”, “dog”, “s”, “bone ”]
  • Elasticsearch also provides many language-specific analyzers such as English or french. Please check the official documentation for more info.
  • The fingerprint analyzer is a specialist analyzer which creates a fingerprint that can be used for duplicate detection. The input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stop-words list is configured, stop words will also be removed. The example sentence will be represented as [ “2 bone brown dog’s foxes jumped lazy over quick the” ].

It is also possible both to define new analyzers and modify existing ones. This last operation actually corresponds to the creation of a new analyzer with the characteristics of the starting one extended with the new ones:

PUT indexName
{
“settings”: {
“analysis”: {
“analyzer”: {
“analyzerName”: {
“type”: “standard”,
“stopwords”: “_english”
}
}
}
}
}

To create a new analyzer instead:

PUT indexName
{
“settings”: {
“analysis”: {
“analyzer”: {
“analyzerName”: {
“type”: “custom”,
...
}
}
}
}
}

The custom analyzer accepts the following parameters:

  • tokenizer, a built-in or customized tokenizer;
  • char_filter, an optional array of built-in or customized character filters;
  • filter, an optional array of built-in or customized token filters,
  • position_increment_gap, a numerical value to prevent most phrase queries from matching across the values (see position_increment_gap for more details).

Query execution and analyzers

When writing a query, we can use the analyzer parameter to specify a search analyzer. If provided, this overrides any other search analyzers. Indeed, at search time, Elasticsearch determines which analyzer to use by checking the following parameters in order:

  1. the analyzer parameter in the search query,
  2. the search_analyzer mapping parameter for the field,
  3. the analysis.analyzer.default_search index setting,
  4. the analyzer mapping parameter for the field.

If none of these parameters are specified, the standard analyzer is used.

In the following example we will use the keyword analyzer to perform a search query:

GET /analyzer_test/_search
{
"query": {
"match": {
"description": {
"query": "that",
"analyzer": "keyword"
}
}
}
}

Please note that usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index.

We also should pay attention to how the documents are indexed. If we are not careful, we could for example lose stop words. Another issue could arise if we change an analyzer in progress. Indeed even though there are documents indexed in different ways, remember that queries will only use the latest version of the analyzer. Thus we could get strange and misleading results. Such an issue can be solved re-indexing the current index using the Update By Query API to update the analyzer.

Conclusion

Analyzers are important algorithms used by Elasticsearch to manipulate text fields. You should now be able to create, modify and recall them at index, field and query level.

In the next article we will go through queries and learn how to write them. Don’t miss it and remember to follow our publication page!

--

--