Handling search results in Elasticsearch

Eleonora Fontana
Betacom
Published in
8 min readMar 8, 2021
Photo by Charles Deluvio on Unsplash

Introduction

Welcome to the last article of the Elasticsearch series!

In this article, we will discuss how to improve the results of an Elasticsearch query. We will start by explaining how to handle the results and then move to more advanced techniques to refine the queries we learnt in the previous articles.

Controlling query results

In this section we will explain you how to adjust the query results. As you will learn, we can handle them in different ways.

First of all, it is possible to change the result format by adding the format=yalm parameter to our request in order to get them in a more readable format:

GET /recipe/_search?format=yaml
{
"query": {
...adjus
}
}

Another way to have a pretty formatted and legible format is to add the ?pretty parameter to the request.

Secondly, you may need to look only at some fields of your documents. You can do so specifying them in the _source field of the request:

GET indexName/_search
{
"query": {
...
},
"_source": ...
}

The _source field accepts the following values:

  • false means that no fields will be returned,
  • a string containing the field we want,
  • an array of fields,
  • an object of the form {“include”: array or string containing the fields we need, “exclude”: array or string containing the fields we don’t need}.

We can control how many documents a query returns in the hits array in two ways:

  • by adding "size": N in the query body,
  • by adding the ?size=N parameter to the request.

The default size is set to 10.

We can look at the second page of the results using an offset. It means we specify how many hits to skip before showing the result. It can be done in two ways:

  • by adding "from": N in the query body,
  • by adding the ?from=N parameter to the request.

Combining size and from together, we can handle results pagination. The following formulas are applied to compute the total number of pages and the value of the from parameter at application time:

Be careful: there is a maximum limit of 10.000 results which is necessary to keep cluster stability.

Dealing with pagination while documents are added, modified or deleted by others could be a big deal. Indeed, unlike cursor in a database, Elasticsearch does not stay still at creation time but updates and undoes pagination.

You can sort results by adding "sort": ["fieldName"] in the query body. The default order is decreasing. In the result you will have sort without specifying the fields used to sort documents. It is possible to specify more sorting logics:

"sort": [
{"fieldName1": "desc"},
{"fieldName2": "asc"}
]

If the field is a date, Elasticsearch will use the milliseconds from epoch.

You can also sort by multi-values fields, such as arrays, but you need to specify how:

"sort": [
{
"fieldName": {
"order": "...", # desc or asc
"mode": "..." # avg, min, max, median or sum
}
}
]

An aggregation is performed on the array and then the documents are sorted.

You can add filters but, as you should already know, they are not considered in the relevance score evaluation.

Proximity searches

The match_phrase query looks for the strings we pass it in the order we pass it. Let’s create a new index “sauce” using the commands available here. For example if we search this new index using the phrase “spicy sauce” we will not get the “Spicy Tomato Sauce” match. This leads to a need to relax the research.

Remember that strings are analyzed and tokens are archived into the inverted index. The same thing happens with the query phrases. The words position is saved as well.

Something we could do is to specify the number of words that could appear between the words of our phrase. It can be done via the slop parameter. Such a query is called proximity search.

GET /sauce/_search
{
"query": {
"match_phrase": {
"title": {
"query": "spicy sauce",
"slop": 1
}
}
}
}

Internally permutations are performed on the query words. Each permutation costs 1 and the total cost cannot be greater than the slop specified. The next table shows what happens with the “Tomato Sauce (spicy)” document.

Proximity influences score thus you should use a slop greater than the one you actually need.

Another way to relax the query is to specify that it is not necessary that all words appear in the document since the score handles such a situation. We can do so by combining the match_phrase and match queries into a bool one:

  • must → match
  • should → match_phrase + slop

Fuzzy match query

In Elasticsearch it is possible to handle typos in different ways. The most common is the fuzziness parameter from the match query:

GET /sauce/_search
{
"query": {
"match": {
"title": {
"query": "delici0us",
"fuzziness": "auto"
}
}
}
}

Internally the Levenshtein distance is evaluated. The distance between the words w₁ and w₂ is given by the minimum number of characters modifications (insertion, deletion or substitution) needed to turn w₁ into w₂. For example d(“delicious”, “delici0us”)=1.

The values accepted into the fuzziness parameter are the following:

  • “auto” which lets Elasticsearch determine automatically the threshold,
  • an integer representing the maximum Levenshtein distance allowed for the query.

The “auto” fuzziness is computed based on the following rule:

Please note that the fuzziness cannot be greater than 2 because of two reasons:

  • studies prove that the 80% of the typos can be corrected with one move,
  • the bigger the fuzziness the worse the performances.

There also is a more complex version of the Levenshtein distance: the Damerau-Levenshtein distance takes into account transpositions of adjacent characters as well. It is possible to disable the transpositions by adding "fuzzy_transpositions": false to the query.

Another way to handle typos is the fuzzy query:

GET /sauce/_search
{
"query": {
"fuzzy": {
"title": {
"value": "Delici0us",
"fuzziness": "auto"
}
}
}
}

Since the fuzzy query is a term-level query, it is not analyzed. Thus the best choice to handle typos is via the match query and its fuzziness parameter. Indeed if we look for “DELICIOUS” via the fuzzy query, we will have zero results since the query is not analyzed and each change from an upper case character to a lowe case one costs one in terms of Damerau-Levenshtein distance.

Synonyms

As we already explained in a previous article, we can create custom analyzers and link them to an index at creation time. In particular, let’s see how to create a new index with an analyzer which has a filter containing the synonyms we could encounter into the documents:

PUT /synonyms
{
"settings": {
"analysis": {
"filter": {
"synonym_test": {
"type": "synonym",
"synonyms": [
"awful => terrible",
"awesome => great, super",
"elasticsearch, logstash, kibana => elk",
"weird, strange"
]

}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_test"
]
}
}
}
},
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}

This means that for example the word “awful” will always be replaced with “terrible” by the analyzer.

Please note that filters are case sensitive thus you should use a lowercase filter before the synonyms one. In general the synonyms filter should be placed before the stemming and as soon as possible with the exception of the lowercase filter.

You can specify the synonyms in a txt file and load them from it:

PUT /synonyms
{
"settings": {
"analysis": {
"filter": {
"synonym_test": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_test"
]
}
}
}
},
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}

The file should have the structure word1 => word2, word3. You can add comments using the #.

Be always careful when working with synonyms. Indeed if you define some documents containing the word “awful” and then add the synonym awful => terrible, the already created documents will not get it. They have the word stored as “awful” since they were analyzed when the synonym was not defined. The right thing to do is re-index the documents or use the Update By Query API.

Highlight matches

We can highlight the words that match a query by adding the highlight parameter in the top level ones:

## Adding a test documentPUT /highlighting/_doc/1
{
"description": "Let me tell you a story about Elasticsearch. It's a full-text search engine that is built on Apache Lucene. It's really easy to use, but also packs lots of advanced features that you can use to tweak its searching capabilities. Lots of well-known and established companies use Elasticsearch, and so should you!"
}
## Highlighting matches within the "description" fieldGET /highlighting/_search
{
"_source": false,
"query": {
"match": { "description": "Elasticsearch story" }
},
"highlight": {
"fields": {
"description" : {}
}
}
}

By default words are highlighted with the “plain” mode.

The result will have a filed highlight containing an array of document parts that matches the query. Field fragments are used since the values could be very long:

"highlight" : {
"description" : [
"Let me tell you a <em>story</em> about <em>Elasticsearch</em>.",
"Lots of well-known and established companies use <em>Elasticsearch</em>, and so should you!"
]
}

The highlighted parts are placed between “<em>” and “</em>” which stand for emerging and make it easy to show highlights at application level. We can change the tags used by using the pre_tags and post_tags parameters into the highlight object.

Stemmed words and synonyms are highlighted as well.

Conclusion

There are different ways to refine Elasticsearch queries and to handle their results. As always we recommend to try them and exercize to improve your skills.

Remember to subscribe to the Betacom publication 👆 and give us some claps 👏 if you enjoyed the article!

--

--