How to use machine learning to find synonyms

Nikhil Dandekar
3 min readMay 10, 2016

--

The majority of synonym generation techniques that I have seen in user-facing websites / apps solve it as a two step problem:

1. Candidate Generation

At this step, given a word, you generate all possible candidates that might be synonyms for the word.

Note that what you mean by “synonyms” usually changes a lot based on domain. If you are writing a word processor, you probably want something closer to the dictionary definition of a synonym. For a search engine, your definition of a synonyms can be much broader and include any alternate words that will help you get better search results for your query. E.g. acronym expansions (CS -> “computer science”), synonyms for proper nouns (“the big apple” -> “new york city”), or even entire phrase substitutions are good “synonyms” for a search engine.

Depending on your exact application, here are some of the sources you can use for synonym generation:

  1. Word embeddings: You can train word vectors for your corpus and then find synonyms for the current word by using nearest neighbors or by defining some notion of “similarity”. See more on this here: What are the kinds of related words that Word2Vec outputs?
  2. Historical user data: You can also look at historical user behavior and generate synonym candidates from that. A simple way to do this for search engines is to look at word and phrase substitutions by looking for the “query, query, click” pattern. So if you see users searching for [buy handbags], not clicking on anything, then searching for [buy purses] and clicking on a result, you can consider “purses” to be a synonym candidate for “handbags”. The simple version of this ignores the context (in this case, the previous word being “buy”), but to get more accurate synonyms, you do want to use the previous and next n words as context. For word processors, you can similarly look at word and phrase substitutions that your users do.
  3. Lexical synonyms: These are grammatical synonyms as defined by the rules of the language. Wordnet is a popular source for these among others.

If there are other synonym sources outside of these that are better suited for your application, you should use them too.

You can use either of these techniques in isolation to solve your problem. E.g. a synonym generation algorithm using word2vec vectors alone might be sufficient for you. But by using just one source you will miss out on the strengths that the other sources offer.

2. Synonym detection

Now that you have a set of synonym candidates for a given word, you need to find out which ones of those are actually synonyms. This can be solved as a classical supervised learning problem.

Given your candidate set, you can generate ground-truth training data either using human judges or past user engagement. As I mentioned above, the definition of synonyms is different for different applications, so if you are using human judges, you will need to come up with clear guidelines on what makes a good synonym for them to use.

Once you have a labeled training set, you can generate various lexical and statistical features for your data and train a supervised ML model of your choice on it. In practice, I have seen that any features associated with past user behavior, such as word substitution frequency, perform the best for synonym detection in specific domains such as search engines.

Antonym generation

You can find antonyms using a similar technique as synonyms. You might have to use different sources (e.g. you might not have historical user data, or you might need a different notion of “relation” between word vectors for antonyms), but the basic structure of candidate generation followed by classification remains the same.

Originally published at www.quora.com

--

--

Nikhil Dandekar

Engineering Manager doing Machine Learning @ Google. Previously worked on ML and search at Quora, Foursquare and Bing.