Full-Text Search with Elasticsearch: Techniques for Effective Search

André Coelho
4 min readJun 6, 2024

--

Understanding Full-Text Search in Elasticsearch

Full-text search is one of the most powerful features of Elasticsearch. Unlike traditional searches that require exact term matches, full-text search allows you to find documents containing similar or contextually relevant terms to the search term. This is crucial for applications where natural language and term variations are common, such as e-commerce sites, content platforms, and document management systems.

In Elasticsearch, full-text search is primarily performed using the text field, which analyzes the content of documents, breaking it down into tokens (or terms) and indexing these tokens to facilitate search. This process involves several steps, including analysis, tokenization, and the application of filters.

Analysis and Tokenization

When a document is indexed, Elasticsearch analyzes the text of the document using a configured analyzer. The default analyzer performs the following steps:

  1. Tokenization: Breaks the text into individual terms (tokens).
  2. Lowercasing: Converts all terms to lowercase for uniformity.
  3. Stop Word Removal: Removes common words that do not significantly contribute to relevance, such as “and,” “the,” “of.”
  4. Stemming: Reduces terms to their root forms to group variations of a word.

For example, the sentence “The quick brown fox jumps over the lazy dog” might be tokenized into [“quick,” “brown,” “fox,” “jumps,” “lazy,” “dog”] after removing stop words and applying stemming.

Techniques for Improving Search Relevance

1. Boosting

Boosting is a technique for increasing the relevance of certain fields or terms in a query. In Elasticsearch, you can apply a boost to a specific field to give it more weight in the search. For example, if you want to give more relevance to the document title compared to the content, you can configure the query as follows:

{
"query": {
"multi_match": {
"query": "quick fox",
"fields": ["title^2", "content"]
}
}
}

In this example, the title field is given a boost of 2, making it twice as relevant as the content field.

2. Fuzziness

Fuzziness allows for approximate matches, which is useful for handling typos or spelling variations. When using fuzzy matching, you can specify the level of approximation allowed. In Elasticsearch, this is configured as follows:

{
"query": {
"match": {
"content": {
"query": "foks",
"fuzziness": "AUTO"
}
}
}
}

Here, fuzziness: "AUTO" lets Elasticsearch automatically determine the level of fuzziness based on the length of the search term.

3. Phrases and Slop

Phrase search allows you to search for exact sequences of terms. However, to allow some flexibility in the order of words or inclusion of intermediate words, you can use the slop parameter. For example:

{
"query": {
"match_phrase": {
"content": {
"query": "brown fox",
"slop": 1
}
}
}
}

In this example, slop: 1 allows for one intermediate word between "brown" and "fox."

Handling Synonyms, Stop Words, and Stemming

Synonyms

Synonyms are important for broadening the search, allowing different terms with similar meanings to be treated as equivalents. In Elasticsearch, this is configured with a synonym filter. First, define a synonym file or input them directly in the index settings:

PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"quick, fast",
"lazy, slow"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
}
}
}
}

Then, apply the analyzer to the relevant field:

PUT /my_index/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}

Stop Words

Stop words are common words that can be ignored in the search to improve efficiency and relevance. They are configured in the analyzer, as shown below:

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}

You can also define a custom list of stop words:

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": ["and", "the", "of"]
}
}
}
}
}

Stemming

Stemming reduces terms to their roots, helping to group variations of the same word. Elasticsearch uses a stemmer filter, configured like this:

PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "light_english"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "english_stemmer"]
}
}
}
}
}

Applying this analyzer to the content field:

PUT /my_index/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}

Conclusion

Full-text search in Elasticsearch is a powerful and flexible tool, enabling efficient and relevant searches across large volumes of textual data. With techniques such as boosting, fuzziness, and phrase search, along with the use of synonyms, stop words, and stemming, you can optimize your queries to better meet user needs. Proper configuration of these elements will ensure a more precise and satisfying search experience, significantly improving the effectiveness of your application.

--

--

André Coelho

Developer of web and mobile systems. Enthusiast in the area of ​​automation and electronics and I have hobbie music.