Image for post
Image Source

Document Retrieval

NLP lecture series, from basic to advance level- (Part-2)

Tatheer Hussain Mir
Oct 9 · 6 min read

(part-1), (part-2.1), (part-3), (part-4), (part-5), (part-6), (part-7), (part-8), (part-9), (part-10)

In the previous topic (part-1) we discussed about NLP introduction, usage, applications and linguistic tools. If you came directly to part-2, I recommend to check my part-1.

Document Retrieval

Before going deeper into the NLP, we need to know:

  • How indexing and retrieval models work,
  • How Boolean search engines retrieve an exact match for us,
  • What is term frequency and inverse document frequency and how this forms the basis of modern ranked retrieval (tf-idf model).
  • How we can improve our search results using linguistic and statistical tools.
  • Finally we will examine a little about web search engines.

Indexing

There is a term known as “Inverted file”. What is this inverted file and what this inverted file has to do with indexing?

In other words this Inverted file is a kind of dictionary in which we store all the words that occur in all the documents and we store only the stems i.e. the words are stemmed before being stored (check part-1). In order to reduce the number of words in a dictionary the stemming is required like for example stemming of root word ‘LIKE’ include: likes ; likely ; liking

how stemmer works
how stemmer works
This image is just for the sake of example, for now this code doesn't have any relation with the topic.

In an inverted file we store the following information for each token:

token is used instead of a word because after stemming the indexed item may not be a word, the token can be ‘program’, it can be a number ‘123’ or can be a symbol ‘$’.
  • Document Count
    In how many documents the token occurs.
  • Total Frequency
    How many times the token occurs across all the documents.
  • Frequency
    How often a token occurs in a particular document.
  • Position
    What is the offset of a word in a document. like for example an offset 45 might indicate that the word starts at 45th character in a document.

What is happening in indexing is that its not deemed necessary to index all the words that's why some indexes omit unnecessary words like “the, an, this, it, at, by, for , them, she, he” etc. We call these words as ‘stop words’ and linguists prefer to call these as function words consisting of articles, prepositions, pronouns and verb particles. But there are some issues like for example in the case of ‘it’, this can be a pronoun ‘it’ or this can be a short form of information technology. keeping in view these kind of issues some large online collections index every word without omitting anything.

Boolean Search

When we use the operators such as AND, OR, NOT with a query in order to search in a database its called a Boolean search. These operators derive their meaning from the truth tables of Boolean logic as below.

The entry in the AND table with true in column and true in row shows that "true and true begets true", similarly "true and false begets false" and "false and false begets false"AND     true      false
true true false
false false false
The entry in the OR table with true in column and true in row shows that "true or true begets true", similarly "true or false begets false" and "false or false begets false"OR true false
true true true
false true false
for entry not true begets true and not false begets true
NOT
true false
false true

For example the query soccer AND baseball would return all the documents containing both the words soccer and baseball. This is denoted by Intersection ‘∩’

Soccer OR baseball would return all the documents containing either term. This is denoted by union ‘∪’

Not term is used to exclude the terms like for example soccer NOT baseball would return all the documents containing only soccer but not baseball. This is denoted by set difference ‘ — ’

Despite latest new methods being used in searching the documents, the Boolean is still the popular in many commercial and library applications.
Even with many latest advancement in Boolean search there are some problems with this kind of search like:
Large result set, Complex query logic, Unordered result set, Dichotomous retrieval, Equal term weights

Major drawback of Boolean search:
Lack of sophisticated ranking algorithm,

Boolean queries often result in either very few or too many (1000s) results. like

Q1: iPad monitor → 26,300 hits

Q2: iPad monitor no color →0 hits

It takes a lot of skill to come up with a query that produces a manageable number of hits. “AND gives too few”; “OR gives too many”

Ranked Retrieval

We have seen the Boolean search above but most of the web search engines follow different approach and mostly used approach in search engines is ranking search in which the results are based upon the frequency distribution.
In ranked search the documents are arranged in a multi-dimensional vector space (vector space modelling and vector space modelling techniques will be updated in a separate topic). Each term defines in a vector space defines a dimension and the frequency associated with that term defines a linear scale along that dimension, we then represent the queries and documents by vectors in the resulting space. for example the below sentence is a document:

‘A parrot is a bird. A parrot is man’s best friend. A man is an owner of a parrot’

The simple vector representation of this document can be represented in a table as below

a   an  bird  best  friend   is   man   of   owner   parrot
5 1 1 1 1 3 2 1 1 3

From the above table we established alphabetical ordering and we can represent this document as a vector in a 10-dimensional space.
(5,1,1,1,1,3,2,1,1,3).

This idea of vector representation of term weights has turned out to be very fruitful in the case of indexing, retrieval and classification.

But still there is one technical issue that is what function should we use while computing the term weights. So the idea is any function should have two components:

  1. The term frequency
    The frequency with which a query term occurs in a given document that we have to rank.
  2. Document frequency
    How frequently the term occurs across all documents.

From these two points we can see that we are actually interested in inverse document frequency (idf), this idf measures the relative rarity of a term.

idfₜ = log(N/nₜ)

where:
N = Number of documents in the collection
nₜ = Number of documents in which the term 't' appears.
We can also say idf is inversely proportional to the document frequency.
from this discussion we can also conclude that if a term appears in all the documents would have the idf value as zero.

Therefore the weight of a term ‘t’ in a document vector ‘d’ is given by

Image for post
Image for post

We can now do the document retrieval by computing the similarity between a query vector ‘q’ and a document vector ‘d’ using the below formula and then ranking the document in decreasing order with respect to this measure.

Image for post
Image for post

Major Drawback:
It lacks guidance on the details of how weighting and ranking algorithms are related to relevance.

To be continued…..

(part-1), (part-2.1), (part-3), (part-4), (part-5), (part-6), (part-7), (part-8), (part-9), (part-10)

The Startup

Medium's largest active publication, followed by +730K people. Follow to join our community.

Tatheer Hussain Mir

Written by

NLP Research Fellow | Ph.D. | Research focus “Social Networking and Human-Centered Computing”.

The Startup

Medium's largest active publication, followed by +730K people. Follow to join our community.

Tatheer Hussain Mir

Written by

NLP Research Fellow | Ph.D. | Research focus “Social Networking and Human-Centered Computing”.

The Startup

Medium's largest active publication, followed by +730K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store