Relevance: Query & Title Similarity Features

karanjude
MLFeatureEngineering
1 min readJan 7, 2018

Detailed Problem Statement: Say you have a bunch of product titles and collection of user queries. What kind of similarity features will you use for query & title relevance

Similarity Features:

  • Overlap Count: The number of words that overlap between the query and the title.
  • Cosine Similarity between TF-IDF Vectors: Assuming query & titles have been converted into their corresponding TF-IDF vectors, cosine similarity between the vectors.
  • Cosine Similarity between Average Word2vec Vectors: Assuming a word2vec model has been learnt over the set of queries and titles OR an existing pretrained word2vec model is used to generate dense vectors a.k.a embeddings for each word in query & title. Sum all the dense embeddings for each word in the title. Divide the sum by number of word in the title. Do the same for query as well. Compute the cosine similarity between the averaged vectors.
  • Levenshtein distance: The edit distance between title ( words ) and query (words)

Note: For effective use of the above features, the words in the title and query should be normalized ( converted to lower case, remove stop words, apply stemming )

--

--