A simple Query Expansion

Shyam Swaroop
2 min readJan 8, 2020

This tutorial is relevant if you are not looking for relevance feedback mechanisms in query expansion. Another limitation of this tutorial is that it’s applicable for generic use cases because WordNet is used as the corpus to expand a query. For domain specific query expansion, a technique like dynamically generating document-term matrix can be used on domain specific corpus.

What to include in expanded query? This actually depends on a particular use case. In my experience, including synonyms and hypernyms gives best result.

Code for Query Expansion

There are few steps in the pre-processing stage i.e. just tokenizer and no lemmatization and stemming step in the pipeline. The reason for this is that WordNet can take a sentence and give synsets irrespective of the word form i.e. having “dog” or “dogs” will give us same synsets. Once we have these synsets we can go ahead and get the synonyms and hypernyms. These synonyms and hypernyms are already in their base lemma.

The second step in pipeline is part-of-speech (POS) tagging. The reason for having this step is related to the synsets in WordNet. The POS tag for a word may depend upon the context which changes the meaning of the word, hence, there are different synsets for the same word depending upon the POS tag it has in the sentence. For example, see below all the synsets for a word like “run”.

In[1]: from nltk.corpus import wordnet as wnIn[2]: wn.synsets('run')Out[2]: [Synset('run.n.01'),
Synset('test.n.05'),
Synset('footrace.n.01'),
Synset('streak.n.01'),
Synset('run.n.05'),
Synset('run.n.06'),
Synset('run.n.07'),
Synset('run.n.08'),
Synset('run.n.09'),
Synset('run.n.10'),
Synset('rivulet.n.01'),
Synset('political_campaign.n.01'),
Synset('run.n.13'),
Synset('discharge.n.06'),
Synset('run.n.15'),
Synset('run.n.16'),
Synset('run.v.01'),
Synset('scat.v.01'),
Synset('run.v.03'),
Synset('operate.v.01'),
Synset('run.v.05'),
Synset('run.v.06'),
Synset('function.v.01'),
Synset('range.v.01'),
Synset('campaign.v.01'),
Synset('play.v.18'),
Synset('run.v.11'),
Synset('tend.v.01'),
Synset('run.v.13'),
Synset('run.v.14'),
Synset('run.v.15'),
Synset('run.v.16'),
Synset('prevail.v.03'),
Synset('run.v.18'),
Synset('run.v.19'),
Synset('carry.v.15'),
Synset('run.v.21'),
Synset('guide.v.05'),
Synset('run.v.23'),
Synset('run.v.24'),
Synset('run.v.25'),
Synset('run.v.26'),
Synset('run.v.27'),
Synset('run.v.28'),
Synset('run.v.29'),
Synset('run.v.30'),
Synset('run.v.31'),
Synset('run.v.32'),
Synset('run.v.33'),
Synset('run.v.34'),
Synset('ply.v.03'),
Synset('hunt.v.01'),
Synset('race.v.02'),
Synset('move.v.13'),
Synset('melt.v.01'),
Synset('ladder.v.01'),
Synset('run.v.41')]

Another important point to note is that the part-of-speech tags in NLTK library and WordNet are differently encoded. Hence we need a mapper as shown in the code above.

--

--

Shyam Swaroop

Founder — Atri Labs, Build apps faster and better