NLP lecture series, from basic to advance level- (Part-2)
You might have visited any old library where the document retrieval system was based on catalog system and have seen how the librarians used to search a book for us but the advent of the world wide web has changed everything and everyone these days is a kind of document retriever and everyone knows about document search and its limitations.
Before going deeper into the NLP, we need to know:
- How indexing and retrieval models work,
- How Boolean search engines retrieve an exact match for us,
- What is term frequency and inverse document frequency and how this forms the basis of modern ranked retrieval (tf-idf model).
- How we can improve our search results using linguistic and statistical tools.
- Finally we will examine a little about web search engines.
There is a term known as “Inverted file”. What is this inverted file and what this inverted file has to do with indexing?
In other words this Inverted file is a kind of dictionary in which we store all the words that occur in all the documents and we store only the stems i.e. the words are stemmed before being stored (check part-1). In order to reduce the number of words in a dictionary the stemming is required like for example stemming of root word ‘LIKE’ include: likes ; likely ; liking
In an inverted file we store the following information for each token:
token is used instead of a word because after stemming the indexed item may not be a word, the token can be ‘program’, it can be a number ‘123’ or can be a symbol ‘$’.
- Document Count
In how many documents the token occurs.
- Total Frequency
How many times the token occurs across all the documents.
How often a token occurs in a particular document.
What is the offset of a word in a document. like for example an offset 45 might indicate that the word starts at 45th character in a document.
What is happening in indexing is that its not deemed necessary to index all the words that's why some indexes omit unnecessary words like “the, an, this, it, at, by, for , them, she, he” etc. We call these words as ‘stop words’ and linguists prefer to call these as function words consisting of articles, prepositions, pronouns and verb particles. But there are some issues like for example in the case of ‘it’, this can be a pronoun ‘it’ or this can be a short form of information technology. keeping in view these kind of issues some large online collections index every word without omitting anything.
When we use the operators such as AND, OR, NOT with a query in order to search in a database its called a Boolean search. These operators derive their meaning from the truth tables of Boolean logic as below.
The entry in the AND table with true in column and true in row shows that "true and true begets true", similarly "true and false begets false" and "false and false begets false"AND true false
true true false
false false falseThe entry in the OR table with true in column and true in row shows that "true or true begets true", similarly "true or false begets false" and "false or false begets false"OR true false
true true true
false true falsefor entry not true begets true and not false begets true
For example the query soccer AND baseball would return all the documents containing both the words soccer and baseball. This is denoted by Intersection ‘∩’
Soccer OR baseball would return all the documents containing either term. This is denoted by union ‘∪’
Not term is used to exclude the terms like for example soccer NOT baseball would return all the documents containing only soccer but not baseball. This is denoted by set difference ‘ — ’
Despite latest new methods being used in searching the documents, the Boolean is still the popular in many commercial and library applications.
Even with many latest advancement in Boolean search there are some problems with this kind of search like:
Large result set, Complex query logic, Unordered result set, Dichotomous retrieval, Equal term weights
Major drawback of Boolean search:
Lack of sophisticated ranking algorithm,
Boolean queries often result in either very few or too many (1000s) results. like
Q1: iPad monitor → 26,300 hits
Q2: iPad monitor no color →0 hits
It takes a lot of skill to come up with a query that produces a manageable number of hits. “AND gives too few”; “OR gives too many”
We have seen the Boolean search above but most of the web search engines follow different approach and mostly used approach in search engines is ranking search in which the results are based upon the frequency distribution.
In ranked search the documents are arranged in a multi-dimensional vector space (vector space modelling and vector space modelling techniques will be updated in a separate topic). Each term defines in a vector space defines a dimension and the frequency associated with that term defines a linear scale along that dimension, we then represent the queries and documents by vectors in the resulting space. for example the below sentence is a document:
‘A parrot is a bird. A parrot is man’s best friend. A man is an owner of a parrot’
The simple vector representation of this document can be represented in a table as below
a an bird best friend is man of owner parrot
5 1 1 1 1 3 2 1 1 3
From the above table we established alphabetical ordering and we can represent this document as a vector in a 10-dimensional space.
This idea of vector representation of term weights has turned out to be very fruitful in the case of indexing, retrieval and classification.
But still there is one technical issue that is what function should we use while computing the term weights. So the idea is any function should have two components:
- The term frequency
The frequency with which a query term occurs in a given document that we have to rank.
- Document frequency
How frequently the term occurs across all documents.
From these two points we can see that we are actually interested in inverse document frequency (idf), this idf measures the relative rarity of a term.
idfₜ = log(N/nₜ)
N = Number of documents in the collection
nₜ = Number of documents in which the term 't' appears.We can also say idf is inversely proportional to the document frequency.
from this discussion we can also conclude that if a term appears in all the documents would have the idf value as zero.
Therefore the weight of a term ‘t’ in a document vector ‘d’ is given by
We can now do the document retrieval by computing the similarity between a query vector ‘q’ and a document vector ‘d’ using the below formula and then ranking the document in decreasing order with respect to this measure.
It lacks guidance on the details of how weighting and ranking algorithms are related to relevance.
To be continued…..