High Quality Autocomplete Search Part 1.
Providing relevant results to millions of customers
Have you searched for hotels in Traveloka?
I bet you’ve tried the autocomplete search. Indeed, Autocomplete search is the main gateway to explore myriads of Traveloka Hotels. By using hotel autocomplete search, users are given the most relevant hotel, region and landmark as-you-type. Autocomplete helps users saving time while typing and minimizes chances for spelling mistakes which result to a better customer experience overall, therefore it is essential for Traveloka to have a high quality autocomplete search.
However, there are several challenges that has to be tackled by hotel autocomplete.
1st, hotel autocomplete is expected to give the most relevant results using the least amount of keystroke.
2nd, since Traveloka is dealing with international users, autocomplete search must be able to handle various languages specific quirks such as in Thai language words are written without space separator e.g เกาะสมุย (ko samui).
3rd, autocomplete should also be swift. No matter how good an autocomplete is, the user experience won’t be enjoyable if it takes too long to give results.
Tackling the Challenges
1. Relevance Score formulation
Users only want to see the most relevant results. Hence, the way relevance score algorithm works is one of the keys of autocomplete search.
1.1 Classical Way of text relevance scoring
The classical way of calculating relevance score is based on term frequency (tf), inverse document frequency (idf) and field-length norm.
Term frequency means the more frequent a term appears, the more significance it is, e.g text “Bora bora island” is more relevant than “boracay island” for query “bora”.
Inverse document frequency implies that a word’s relative weight is related to the inverse of its occurences in all documents, making more common words less significant than the unusual one. For example in text “Eiffel tower”, word “Eiffel” carries more weight than “tower” because “tower” is much more common and general than “Eiffel”.
Field-length norm describes that the longer the text, the less significant it is compared to the shorter ones. e.g text named “Jakarta” is more significant than “West Jakarta” because “Jakarta” is shorter than “West Jakarta” for query “Jakarta”.
Therefore the classical way of relevance score formula can be summarized by
Relevance Score = Tf * Idf * norm
1.2 Improvement by BM25
The classical way of relevance calculation is quite a huge step forward. However, for a short field — typical of traveloka entities — the relative significance of each factor (tf, idf and norm) shouldn’t be treated equal otherwise it will lead to erroneous results. In short texts it is much better to dampen the effect of Tf and field norm compared to Idf.
To tackle this problem we use BM25 as a basic way of calculating relevance score. BM25 itself is quite an extensive method of its own (wikipedia), but by using BM25 the relative significance of each factor (Tf, Idf and field-length norm) can be adjusted.
1.3 Varying boost for every fields.
A document entity consists of several fields such as name, alias, additional infos. Each field should carry different weights, e.g name field should be more important than alias and additional info. Therefore matching with name fields gives better score than matching with additional infos or alias field.
1.4 Adding popularity and geoLocation based boosting
To further improve the search result, we tune the calculated score with popularity data and geolocation.
Boosting by popularity means the more popular documents will have higher relevance scores compared to the less popular documents, other things being equal. The data can be inferred by various aspects, notably the amount of user search. Furthermore, popularity data should also be made per locale since popularity is location specific.
We also boost the relevance score by user’s location. Results that are located in the same region as the user’s will be given an additional boost.
1.5 Adding Exact boost.
Although the above factors seems sufficient, we could instantly find a serious flaws in the algorithm. Since it emphasizes on popularity data, many unpopular entity will always ends in the bottom of the search result, no matter how exact the query is, making unpopular entity hardly searchable.
This problem is exacerbated in hotels since they usually have eccentric names. They might have names such as “hotel j”, “hotel i”, “hotel m”, etc. Based popularity alone, query “hotel j” will usually give “hotel jxxx” (the most popular one) on top of the lists, leaving “hotel j” close to the bottom because of its low popularity.
To alleviate this problem, we add boost to the entities that exactly match the query itself. For example when the user queries “hotel j”, hotel with exact terms “hotel j” is given additional boost.
2. Dealing with human languages
Autocomplete search is related to the way human interacts with languages. Hence, a lot of things need to be implemented and taken care of compared to a non-text search.
2.1 Analyzing Process
The first step for a text search is analyzing the text. In this process, the text is transformed into simple searchable tokens. Analyzing consists of several steps such as tokenizing, stemming, synonym, and shingle processing.
Tokenizing means separating the texts into smaller searchable tokens. The text separator should not only be whitespaces, but also numbers, symbols and other non-numeral characters since users usually treats those as a separator.
The following example is quite common in user search :
In stemming process, words are reduced into its basic form e.g “eating” into “eat”, “buying” into “buy”.
In synonym process, the words are expanded into its synonym, making tokens also searchable by their synonyms.
Sometimes user also search the words without space in between. For example “Hong Kong” document should also be searchable by keyword “Hongkong” (without space). Therefore the problem is tackled by shingle process which combine adjacent tokens.
2.2 Specific Language features
As explained before, several languages have their own peculiar characteristics. In Thai language, words are written without space separator such as เกาะสมุย (ko samui). This behavior will fail whitespace-based analyzer because of the lack of apparent token separator. The solution? use specific language analyzer to analyze text from various languages. By using Thai analyzer, text เกาะสมุย will be tokenized into เกาะ (koh) and สมุย (samui).
2.3 Typo Handling
It is quite common for users not to know the exact text they’re looking for, making them prone to typo. Therefore, autocomplete system should also be able to suggest them not only the documents that contain the exact tokens but also the ones that quite close to the query.
To tackle this problem we use fuzzy query which allows text that is within a certain Levenshtein edit distance (delete, insert and swap) to be considered match.
For the last touch, we don’t want to give documents that were matched via typo to have a high relevance score, hence documents that are matched by fuzzy query were given lower boost.
3. Creating a high performance autocomplete system
A sophisticated autocomplete system would be meaningless if it takes too much time to produce results. Furthermore, a slow autocomplete system will also create an unpleasant user experience — something to avoid at all cost.
There are a lot of ways to improve latency performance of autocomplete, which itself is an interesting topic to cover in a separate blog. One of them is simply moving the computation intensive part from query time to indexing time. It is essential for the query time to be fast and swift, whereas indexing time doesn’t have an impact to user.
For the rest, stay tuned for the second part of this blog where I will discuss about the system architecture and solution for high-performance autocomplete service.
I am lucky to face this fascinating, open-ended search algorithm challenge with my team at Traveloka, one of the largest online travel companies in Southeast Asia. If you’re a software engineer interested in developing state of the art search system, helping millions users to find their next adventures, have a look at the opportunities on Traveloka’s careers page!
Special thanks to Jogie for the discussion on the previous autocomplete implementation, Yosef Sidharta, for challenging the algorithm with tons of user cases, and Evan Yonathan for the invaluable guidance.