This is the first article in a series in which we’ll explore the use of synonyms.
Synonyms is one of the most used tools in search as it allows users to manually fix boolean relevancy problems by being able to expand upon terms based on their equivalence. Using synonyms allow users to find documents through multiple terms that might not have been used in the original document definition.
In Solr, there are two types of synonyms: equivalent and replacement synonyms. In this first post we’ll focus on equivalent synonyms.
Understanding Equivalent Synonyms
Synonyms as a tool is used to extend terms in order to fix specific relevance problems, particularly, those related to searchable content and users not speaking the same “language”. Content is created using some specific and reduced language, but the truth is that users apply different terms to look for the same products. Some of those terms may not even be part of the index, so without any tool, popular terms won’t always find or return the products users expect.
Synonyms can be applied at:
- Query time: Query time synonyms expand terms used in a query and are therefore able to match more documents in the index.
- Index time: Index time synonyms expand the terms used at the point of indexing, so that the index contains more terms and, therefore, documents can be more easily matched.
As an aside, both types help to fix the relevancy problem but, they are different and as such have different drawbacks. We’ll do a deeper analysis of this in a future article.
An equivalent synonym in Solr is what we understand as a synonym in natural language. By definition, a synonym is “a word or phrase that means exactly or nearly the same as another word or phrase in the same language” (Definition from Oxford Dictionaries). Using this definition, an equivalent synonym should only be used to extend terms with other terms that mean exactly the same or nearly the same.
The Importance of Choosing Good Synonyms
A good use of equivalent synonyms allows users to find televisions, say, by searching for “Television”, but also by searching for “TV”, even though in the original content only the proper name “Television” may appear. Actually, this is the perfect example. Based on the definition above, TV and Television are two different terms that mean exactly the same.
By using this synonym, a user will find all the products containing either TV or Television by using any of those terms in the query.
However, not all synonyms are as easy to define as the TV/Television example. Synonyms should be defined by understanding the context of how synonyms are applied and the document set the index contains.
The Complexity of Multi-Keyword Synonyms
Users expect to apply synonyms based on keywords, so when a multi-keyword synonym is present, such as “special price” = “promotion”, the synonym is only applied when “special price” appears all together in the indexed terms. This cannot be applied, for instance, when “special” and “price” are in the same field but have terms between them. That’s the correct expectation and that’s how synonyms behave.
However, when the synonym is applied at index time, once it’s applied using the keyword concept, the three terms are present in the index and hence, the document is findable by any of them. This means that searching for “special” would return a document that originally had the term “promotion” or “price”, whereas by applying it at query time, the search “special” would return documents containing the term “special”, such as those containing “special price” but it wouldn’t retrieve promotions. Those would be returned only by searching for either “promotion” or “special price”. In this example, it looks reasonable, but it doesn’t work so well in other examples.
For instance, a common Spanish example is a synonym “traje de baño” (swimsuit) and “bikini”. If the synonym is applied at index time, “traje de baño” exact matches are extended with the term “bikini”, but once this synonym is indexed, a user searching for “traje” (suit) can find out that some results are bikinis, because for those terms that were indexed as “bikini” the index also contains “traje de baño”.
Users searching for “traje” (suit) are looking for suits, so by adding bikinis to the result set, the precision and recall decreases and top results are contaminated causing a bad user experience that can also affect sales, in this instance.
So, in our example:
- The synonym could work by using it at query time, whereas it doesn’t work at index time
- The synonym could work if the document set does not contain suits
That’s why understanding context is really important in order to choose good synonyms.
Bad Use of Equivalent Synonyms
Extending terms affects the number of documents that a query matches, this also means that precision and recall are affected. So, as with every relevancy related tool, a good use of it should improve the user experience on the search platform, whereas a bad one can cause big problems by retrieving documents that have nothing to do with the query.
Synonyms is a great tool, however, engineers have to understand that this tool has its own scope and it shouldn’t be used to fix terms that the search platform should take care of by itself, such as lemmatization or spellchecking. Doing so would be a way of hiding platform problems behind successful specific queries.
Talking about equivalent synonyms in particular, one of the common problems is that of solving non-equivalence problems, such as subsets and parent groups. It’s important to understand that subsets are not the same as a parent group. When an equivalent synonym is used to retrieve subsets when actually searching for the parent term, what actually happens is that the parent term search is fixed but the subset term search is contaminated.
Let’s have a look at an example. If an equivalent synonym “iPhone” = “smartphone” is used, the expected result would be to find iPhones for the search term “smartphone”, and indeed, this would happen. But, it would also work the other way around too, and all the smartphones would be retrieved for the search “iPhone”, decreasing precision and recall, and worsening the user experience by showing unexpected results.
The problem here is that “all iPhones are smartphones, but not all the smartphones are iPhones”. If this phrase can be applied to an equivalent synonym, it means that it’s actually not an equivalent synonym, and therefore, applying it in such a way would mean that the search results are going to be contaminated in some way.
Synonyms is very a powerful tool that allows search admins to bring searchable content closer to the language users speak.
A good use of synonyms will improve the user experience while searching on the platform but, as with any other relevancy matching related tool, it should be applied carefully and with caution. A bad use can contaminate search results, making the user experience worse and moving users further away, rather than closer to, their final goal.