Making Search Relevant Using Elasticsearch

Published in

Unacademy Engineering

10 min readAug 3, 2017

Search is one of the primary sources from where the users find the content on a website, and for a site like Unacademy which has thousands of courses and lessons, it’s important to return the best and most relevant results to the user. Unacademy’s search is backed using Elasticsearch. I will not go into detail to explain everything that it does, you can look up in the docs for that, rather the focus of this article will be on how we used it to bring more relevancy in our search results.

Elasticsearch is quite a great product, and most of the things that you may need to build a good search solution are provided out of the box. For a year, Unacademy’s search was using the basic text search capabilities provided by Elasticsearch, which is fine if you want to return results based only on the textual relevance, that is, how well the search terms match with those in the documents. But as the platform grew big and there were multiple courses/lessons on the same topic, it became critical to return search results such that the good courses are ranked higher than the not so popular ones. Otherwise, the users could miss out on the great courses we got to offer.

For every course on the site, we collect a number a metrics like the number of reviews, the date of release of the course, ratings, etc. We collect similar metrics for every lesson and educator. The challenge put forth to me was to augment these metrics with the textual relevancy of search results and deliver just the right search results. It’s not as straightforward as simply sorting the results according to the number of reviews or anything else, a course that was recently released and is doing good has to be ranked ahead of an older course with a slightly higher number of reviews. A ranking algorithm had to be devised carefully factoring all the above parameters, and Elasticsearch was the way to go. This article discusses the various approaches used to develop the search and improve its relevancy.

Handling the Textual Relevancy

In Elasticsearch the data on which the search has to be performed are stored as documents. Each document contains many fields. You can think of it as a JSON document with a schema (defined in mapping). We have indices for courses, lessons and users. For courses, this document contained title, description, the lessons in course, total ratings, average rating, educator, and few other things.

Elasticsearch by default uses the TF/IDF and Field Length Norm algorithms to implement textual relevancy. The TF or term frequency favours the documents which have more number of terms as in the search query. IDF or inverse document frequency is like when a word appears in a lot of documents, and you don’t have to give it much weightage. Essentially this ensures that words like ‘and’, ‘the’, etc. are not given much importance. Field length normalisation gives more score to the query terms that are found in shorter fields, so if one of your search terms is found in the title field rather than in description, it will have a higher score. Although these algorithms are great, and work well in most use cases, in our particular case, we had to do away with all three for good reasons. Consider the courses, if someone is searching for ‘economy’, and there’s a course which has that word three times in title, it will be given a much higher score, now that’s unfair to the course which mentions economy only one time but has higher ratings. Field length norm was creating issues because some people keep the title really long and some keep it short. And disabling IDF allowed us to have a better understanding of the scores as we anyway weren’t indexing English stop words. All that we cared about was if the search query terms were present in the document or not, and how well they match it in terms of percentage. And to give more weightage to title or other things, we could just increase the weight of that field. Search query at this point looked similar to this:

{  
   “query”:{  
      “multi_match” :{  
         “query”:searchText,
         “fuzziness”:“AUTO”,
         “fields”:[  
            “title^2”,
            “educator.name^2”,
            “description”,
            “lessons”
         ],
         “tie_breaker”:0.3,
         “minimum_should_match”:“30%”
      }
   }
}

One of the goals for the search was to make it typo tolerant, and that is very easily done using “fuzziness”:”AUTO”. But this created problems when the search results were augmented with the non-textual factors like the total ratings. Though Elasticsearch will give a higher score to a non-fuzzy match, the difference isn’t large enough. So if you search for ‘Economy’ and there’s a super popular course with title ‘Ecology’ (edit distance = 2) the net score turns out to be in favour of ‘Ecology’.

Enter constant_score

Elasticsearch has something called constant_score. If you wrap your query with constant_score and provide a fixed score (boost), whenever there is a match, it will always be given the same fixed score. This implies that TF/IDF and Field norms won’t apply, and further we can give whatever score we want when there is a match. How this solves our earlier problem of fuzzy search? That comes from the clever usage of constant_score. So there are two constant_score query: one without fuzziness and a score of say 15, and the other with fuzziness but with a score of say 5.

{  
   “constant_score”:{  
      “boost”:15,
      “query”:{  
         “match”:{  
            “title”:{  
               “query”:searchText,
               “minimum_should_match”:”30%”
            }
         }
      }
   }
},
{  
   “constant_score”:{  
      “boost”:5,
      “query”:{  
         “match”:{  
            “title”:{  
               “query”:searchText,
               “minimum_should_match”:”30%”,
               “fuzziness”:”AUTO”
            }
         }
      }
   }
}

This allowed the differences in scores of a fuzzy match to be much larger — 15 (20–5), and if there were no exact matches, the fuzzy ones are still there with a score of 5. Perfect. This solved the problem with fuzziness, but there’s one more problem! The score is the same even for the documents which were having more of the search terms. So If I searched ‘Medieval History’, the top results will be of modern history. The documents containing both the terms ‘medieval’ and ‘history’ are returned, and all of them are given the same score. Even the ones who have history but not medieval! That means, after going through the popularity algorithm, the results for modern history which has more popular courses will be higher on the list. How do we fix this? Again by using multiple constant_score queries with varying minimum_should_match. So the query with “minimum_should_match”: “100%” will have a score of 20, and the query with “minimum_should_match”:”50%” will have a score of 10. This way we can set multiple minimum_should_match levels and score will be cumulative sum of them. Thus by increasing the textual relevancy scores we can overshadow the effect of popularity. The good thing about this approach is that we have essentially gained a very fine-grain control over the scoring algorithm and can give the score as we like.

Factoring in Popularity and Recency

Coming to the main goal: the good courses should be higher up in the search results. What makes a course better than other depends on multiple factors like its ratings, total lessons, published data, etc. To be able to combine these and the textual relevancy scores we use something called function_score. Simply put, you can wrap the above query within function_score, and the set of documents that will be returned from the text search query will then be input to various functions, the output of which is added or multiplied (depending on “score_mode”). These functions will look at the different non-textual fields in the document, like ratings or published date, and calculate a score. The final score can be a product or sum of the two scores — textual and function score. In our case, we use the product of two, as the ranges at which text based scores and function based scores work can be different, so it’s safer to take the product of the two, we don’t want super popular courses to dominate in every search query. Remember, the priority is always the most accurate textual match, in case the textual scores are near about same, we go for popularity and other things. The whole query is somewhat like this:

“query”:{  
   “function_score”:{  
      “query”:{  
         “bool”:{  
            “should”:[  
               …various constant score queries
            ]
         }
      },
      “functions”:[  
         {
            “script_score”:{  
               “script”:{  
                  “inline”:”Math.log(1 + doc[‘avg_rating’].value *     doc[‘total_ratings’.value)”
               }
            },
            “weight”:1.5
         },
         {  
            “gauss”:{  
               “published_at”:{  
                  “origin”:pubTime,
                  “scale”:”180d”,
                  “offset”:”30d”,
                  “decay”:0.2
               }
            },
            “weight”:2
         }
      ],
      “score_mode”:”sum”,
      “boost_mode”:”multiply”
   }
}

There are mainly two types of functions that we use here: script_score and gauss. Script score does the computation and multiplies the whole with a given weight. Using the weight, you can control the relative importance of a particular function. You may notice that we are taking the log of ratings, this is because of the nice nature of the logarithmic function. The effect of the log will be more pronounced when the ratings change from 100 to 1000 rather than from 1000 to 2000.

The gaussian function is quite interesting and is illustrated below.

All the new courses are given a score of 2 by default. The way it works is by penalising the courses on this default score. If the published date is within the cutoff, there’s no penalty. But after that, the score decreases steeply which is controlled by the decay value. Decay determines the score of documents at the distance of scale. So after 180 days, the recency score will be reduced to 0.2.

Adding Synonyms to Search

Elasticsearch is a powerful beast, but it’s not smart enough to understand that ‘economy’ and ‘economics’ or ‘mathematics’ and ‘math’ mean the same thing! Upto this point we had a pretty good search which was able to display the good courses on the top. But if someone searched for economics they will not get the courses in which the educator has used only ‘economy’ in title and description. And this was a big problem because the most popular course in economics on Unacademy does not mention ‘economics’ but ‘economy’. Thankfully Elasticsearch has built-in support for synonyms. You give Elasticsearch a list of synonyms, and it works by expanding (or contracting) a token to all of its synonyms either at index time or at query time. Both approaches has its pros and cons, and I recommend going through this guide for a better understanding of how it works.

One important difference is that with index time synonym filter you cannot add new synonyms without reindexing all the documents. So we decided to go with the query time strategy, only to move to index time later. You cannot update the synonyms list in the mapping without closing the index, and after updating you can open it again. But it turns out that Amazon Elasticsearch service doesn’t support closing of the index! This was also verified by contacting their support. So it leaves us with no option but to reindex all the documents whenever there was an addition to the list of synonyms.

One more caveat with AWS is that it doesn’t support the synonym file. The only way to add synonyms is via updating the mapping of the index, which can make your indices big. So instead of hardcoding the list in the JSON mapping file, we decided to let it be empty:

“synonym_filter”:{  
   “type”:“synonym”,
   “synonyms” :[]
}

And just before sending the mapping file to Elasticsearch we populate this field dynamically from synonym field in our database. So to update the synonyms you just update the list in the database, sweet!

However, this means that after every update in the database re-indexing needs to be done. And reindexing doesn’t come without its set of challenges!

Reindexing Documents with Zero Downtime

While your documents are being re-indexed the search will not return complete results. And for us it was taking a good 5–10 minutes to re-index courses, users and lessons, which was bound to only increase. To deal with this, Elasticsearch has Index Aliases. You can assign an alias name to indices, and whenever you call on this alias, it gets routed to the indices it contains. For example, the course index can have the name ‘course_v1’, and we can assign this index to the alias ‘course’. So all the query operations in our app are now to ‘course’ rather than ‘course_v1’ while the updates and inserts are direct to the index (reason later). And when you need to reindex, you create a new index ‘course_v2’, after it’s done reindexing, you quickly point the ‘course’ alias from ‘course_v1’ to ‘course_v2’. This is implemented somewhat similar to the following code:

Reindexing pseudo code.

One thing to look out for here is while the reindex procedure is undergoing what happens to the index updates from the other places? Whenever an educator publishes a course, or a new lesson is added to course, we send insert requests to Elasticsearch. So it’s important that all these updates take place in the new index that is being created and not to the alias. And this is also the reason that we increment the version number of current index prior to starting the reindex procedure.

Concluding Remarks

In this article I have gone through the various challenges and pitfalls I faced while integrating Elasticsearch. It is far from being a complete guide. I have not gone into the implementation details of most of the things, which you can read from this excellent guide. I have referred to it extensively during the course of developing the search.
Making a good search engine for your website is definitely not a trivial task. It took me about a month to have a decent search engine up and running. A good part of it involved making sure that the data in our database is consistent with what in the indices. If an educator changes his name, not only we need to change his name in the educator index, but also in the documents of the course index which contain his name (that’s the catch with NoSQL). Such updates can be huge in number, and for us, there were around 3000 such updates every 10 minutes. If they are done on the API server, it can affect the performance of the server. Instead, we send these update requests to a queue, from there the worker nodes fetch these and do the updates in Elasticsearch. The updates like a new follow or views on a lesson are very frequent and are sent only once in a while.

The search is live on unacademy.com, and I welcome you to try it out. The best part of it was when our users started showing a greater interest in the search results, and all the efforts finally paid off :)