Favouring full matches in Elasticsearch

How to get the most relevant matches on top while keeping a long tail of less — but still — relevant results

The problem

To illustrate the problem, assume we have two Elasticsearch indices: one contains a list of some painters and the other of composers.

To try the examples yourself, you can run the following bulk index request on your Elasticsearch instance:

POST /_bulk
{"index": {"_index": "painters"}}
{"name": "Vincent van Gogh"}
{"index": {"_index": "painters"}}
{"name": "Rembrandt van Rijn"}
{"index": {"_index": "painters"}}
{"name": "Frans Hals"}
{"index": {"_index": "painters"}}
{"name": "Johann Adam Ackermann"}
{"index": {"_index": "painters"}}
{"name": "Piet Mondriaan"}
{"index": {"_index": "painters"}}
{"name": "Claude Monet"}
{"index": {"_index": "painters"}}
{"name": "Jackson Pollock"}
{"index": {"_index": "painters"}}
{"name": "Andy Warhol"}
{"index": {"_index": "painters"}}
{"name": "Frida Kahlo"}
{"index": {"_index": "painters"}}
{"name": "Johannes Vermeer"}
{"index": {"_index": "painters"}}
{"name": "Leonardo da Vinci"}
{"index": {"_index": "painters"}}
{"name": "Pieter Breugel"}
{"index": {"_index": "composers"}}
{"name": "Johann Sebastian Bach"}
{"index": {"_index": "composers"}}
{"name": "Johann Christoph Bach"}
{"index": {"_index": "composers"}}
{"name": "Johann Ambrosius Bach"}
{"index": {"_index": "composers"}}
{"name": "Clara Schumann"}

Now, to illustrate the problem, let’s search for persons with the name ‘Johann Sebastian Bach’:

GET /painters,composers/_search
{
"query": {
"match": {
"name": "johann sebastian bach"
}
}
}

Which returns:

{
...
"hits" : {
...
"hits" : [
{
"_index" : "painters",
"_type" : "_doc",
"_id" : "xIxFCWsB7VA-_8kTLQH0",
"_score" : 1.9334917,
"_source" : {
"name" : "Johann Adam Ackermann"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "zYxFCWsB7VA-_8kTLQH1",
"_score" : 1.8485742,
"_source" : {
"name" : "Johann Sebastian Bach"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "zoxFCWsB7VA-_8kTLQH1",
"_score" : 0.6877716,
"_source" : {
"name" : "Johann Christoph Bach"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "z4xFCWsB7VA-_8kTLQH1",
"_score" : 0.6877716,
"_source" : {
"name" : "Johann Ambrosius Bach"
}
}
]
}
}

Whoa, what’s happening here? Contrary to our expectations, Bach is not the topmost search result. Although we have a literal match for ‘Johann Sebastian Bach’ (score ~1.85) in our composers index, painter ‘Johann Adam Ackermann’, who only matches a fraction of our search query, scores higher (~1.93)!

Please explain yourself

As always, we can ask Elasticsearch for an explanation:

GET /painters,composers/_search?explain=true
{
"query": {
"match": {
"name": "johann sebastian bach"
}
}
}

This gives us a hint as to what’s going on: unlike us humans, Elasticsearch has no knowledge of the name ‘Johann Sebastian Bach’ as a coherent unit, so it searches for each term separately.

  1. First, a tokenizer breaks the query up into three terms: johann OR sebastian OR bach.
  2. Then, Elasticsearch searches for each term separately.
  3. Finally, it calculates the total scores by combining the scores for each term.

So Elasticsearch runs an OR query by default. That is, at least one of the search terms must match, but they don’t all have to. This explains why Johann Adam Ackermann is included in the results even though only one term (‘Johann’) matches our query.

Inverse document frequency bites us

But this doesn’t yet answer the question why Ackermann ranks higher than Bach. What makes Ackermann more relevant? This has to do with how Elasticsearch calculates relevance: it relies on the TF/IDF algorithm. It’s the IDF (Inversed Document Frequency) part that bites us: for a given search term, the more documents it appears in, the less relevant it is considered to be. So terms that occur in many documents have a lower weight. In general, this makes sense: if you search for the well-tempered clavier’, you’re not interested in all documents that contain common terms such as ‘the’ but only in the few that mention claviers, preferably of the well-tempered kind.

If you look at the data above, you’ll see that both indices combined contain four Johanns and three Bachs. So that still makes Bach the more unique and therefore relevant term, doesn’t it? Unfortunately not, because:

Each field has its own inverted index and thus, for TF/IDF purposes, the value of the field is the value of the document.

That is to say, we need to differentiate on the field level, not (only) on the index level. (Even though the field in both indices is called name, the fact that they belong to two different indices makes them two fields.) In this case, we have only one Johann (Ackermann) in thepainters.name field and three in composers.name, which indeed raises the relevance of the painter above that of the composer (Johann Sebastian Bach).

Solution 1: match only full results

As we saw above, Elasticsearch combines the terms using OR by default. An obvious solution, then, is to tell Elasticsearch to match all search terms. You can do so by changing the operator to AND:

GET /painters,composers/_search
{
"query": {
"match": {
"name": {
"query": "johann sebastian bach",
"operator": "and"
}
}
}
}

Voilà, we get only one result and it’s JSB. Done?

Well, what if the user confuses Bach with that other famous composer, and searches for johann van bach instead? That query now returns zero results (because van is nowhere to be found in Bach’s name), which is a bit too harsh on our users.

Solution 2: favour full results

We can solve this by replacing the custom operator with minimum_should_match:

GET /painters,composers/_search
{
"query": {
"match": {
"name": {
"query": "johann sebastian bach",
"minimum_should_match": "2<75%"
}
}
}
}

2<75% means that:

  • if you supply only two search terms (e.g., johann bach), they must all match (johann AND bach);
  • but if you supply more than two terms (e.g., johann van bach), only 75% (rounded down) must match, so this comes down to(johann AND van) OR (van AND bach) OR (johann AND bach).

Now our search results contain only the three Bachs, with Johann Sebastian ranked highest:

{
...
"hits" : {
...
"hits" : [
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "zYxFCWsB7VA-_8kTLQH1",
"_score" : 1.8485742,
"_source" : {
"name" : "Johann Sebastian Bach"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "zoxFCWsB7VA-_8kTLQH1",
"_score" : 0.6877716,
"_source" : {
"name" : "Johann Christoph Bach"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "z4xFCWsB7VA-_8kTLQH1",
"_score" : 0.6877716,
"_source" : {
"name" : "Johann Ambrosius Bach"
}
}
]
}
}

However, this removes painter Johann Ackermann, whom we might want to return as a partial match for ‘johann’, from our search results.

By Ian McFegan

Solution 3: favour full results the right way

So both changing the operator to AND (solution 1) and supplying a minimum_should_match (solution 2) are too strict. A better and more flexible solution is to favour full results but still include a long tail of partial matches.

The way to go about this, is to make use of the way the should clause works in composite bool queries. The should keyword resembles a regular OR but is different in one important aspect: the more clauses match, the more relevant the document is considered to be.

This makes it very natural to specify our favoured results as a set of should clauses. We start by wrapping our original query in a bool query:

GET /painters,composers/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "johann sebastian bach"
}
}

]
}
}
}

While this doesn’t yet change our search results, it opens up the way to add more should clauses. So, building on this, we can consider a document to be more relevant if it contains the combination of the terms in the same order as the query. In Elasticsearch parlance, this is a phrase match. Simply add a should clause for that:

GET /painters,composers/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "johann sebastian bach"
}
},
{
"match_phrase": {
"name": "johann sebastian bach"
}
}
]
}
}
}

Finally, we can add the minimum_should_match query back to our composite query:

GET /painters,composers/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "johann sebastian bach"
}
},
{
"match_phrase": {
"name": "johann sebastian bach"
}
},
{
"match": {
"name": {
"query": "johann sebastian bach",
"minimum_should_match": "2<75%"
}
}
}

]
}
}
}

Which gives us what we’re looking for:

{
...
"hits" : {
...
"hits" : [
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "j2VWR2sBKYAfkVJkD-Xu",
"_score" : 5.5457225,
"_source" : {
"name" : "Johann Sebastian Bach"
}
},
{
"_index" : "painters",
"_type" : "_doc",
"_id" : "hmVWR2sBKYAfkVJkD-Xu",
"_score" : 1.9334917,
"_source" : {
"name" : "Johann Adam Ackermann"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "kGVWR2sBKYAfkVJkD-Xu",
"_score" : 1.3755432,
"_source" : {
"name" : "Johann Christoph Bach"
}
},
{
"_index" : "composers",
"_type" : "_doc",
"_id" : "kWVWR2sBKYAfkVJkD-Xu",
"_score" : 1.3755432,
"_source" : {
"name" : "Johann Ambrosius Bach"
}
}
]
}
}

Conclusion

All search is a trade-off between precision and recall. To strike a good balance between the two, you can describe your desired results in increasing specificity using should clauses. This gives you the best results first while keeping a longer tail of less — but still — relevant results.