When I talk to software engineers and product managers about improving their search engines, the conversation often leads to query expansion, and specifically synonyms. A lot of folks who work on search believe that their biggest problem is not having enough synonyms in their dictionary.
Synonyms are useful, but they aren’t a cure-all for search problems.
Having a good synonym dictionary is a good thing. When someone searches for a couch, the search engine should also return sofas. Achieving good recall usually starts with assembling a robust collection of synonym pairs.
But it doesn’t end there. And naively relying on synonyms to address recall problems is likely to create even worse precision problems. For example, glasses and eyeglasses seem reasonable as a synonym pair, but a search engine shouldn’t return results for eyeglasses when someone searches for wine glasses. We’ll get back to that example in moment.
Because it’s easy to add synonyms to a dictionary, search engine developers, as well as business users with limited power to administer the search engine, often use synonyms to address recall issues — especially embarrassing recall issues. But as with most ad hoc debugging efforts, this approach often creates unanticipated — and undesirable — consequences. It leads to a bloated synonym dictionary that is filled not only with words whose meaning depends on context, but also with misspellings and pluralizations that would be better addressed by spelling correction and lemmatization.
In natural language, there’s no such thing as context independence.
At this point, you may be thinking that the main problem with synonym dictionaries comes from inadvertently including pairs where the meanings of both words depend on context. Perhaps the solution is to exclude any pairs of words or phrases that don’t mean the same thing 100% of the time.
You can do this, but you’ll end up with an empty synonym dictionary. Or you’ll be left with a meager collection of extremely conservative pairs like color and colour. You’ll have thrown the baby out with the bathwater — not that those two words should be synonyms.
In natural language, there’s no such thing as context independence. In the best case, two words will mean essentially the same thing, most of the time. But there are always exceptions, and it is impossible — or at least impractical — to try to anticipate and account for every exception. You need to accept, a priori, that what words mean — and hence whether two words or phrases are synonyms — always depends on context.
But don’t despair! The context is right in front of you!
Fortunately, search usually offers you a great source of context about what a word means — namely, the rest of the search query. Our search engines may not have achieved “superintelligence”, but they can at least take advantage of the search query as context to narrow down the meaning of a word.
For example, a search engine should be able to figure out that a search for wine glasses is targeting kitchen supplies, and therefore shouldn’t return wine-colored eyeglasses (yes, they exist). Indeed, there are probably no results for wine eyeglasses in kitchen supplies.
If the search engine can figure out the general category for a query, it can mostly eliminate the risk of synonyms taking the meaning of words out of context. Automatically classifying the query into a general category isn’t an easy problem, but the return on investment makes it a top priority if search is important to you. And you can do even better if you can recognize query segments and entities.
Use synonyms to increase recall and query context to ensure precision.
Using query context to ensure precision means that you can afford to err on the side of recall with synonyms. Don’t throw all caution to the wind — it’s still possible to create mischief with bad synonym pairs. But, rather than trying to come up with a perfect collection of synonyms, you’re better off delegating responsibilities, using synonyms to ensure recall while relying on query categorization and other query understanding techniques to ensure precision.
And please, don’t use synonyms to implement lemmatization and spelling correction! You’re much better off separating this different query rewriting strategies and solving each of them with the appropriate tool.
That’s all, folks! I hope you’ve enjoyed this real talk about synonyms and search.