Boolean Retrieval Model Using Inverted Index and Positional Index

Muhammad Taha Khan
Analytics Vidhya
Published in
7 min readApr 2, 2020

Information retrieval (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.[1]

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day.The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user’s query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms.[2]

In this article I will share the difficulties I faced and also some concepts that I implemented of Boolean Information Retrieval Model.

The article contain the following steps in order to make documents index, process query and display result of the queries.

Python Implementation of the Boolean Model!

1. Collecting data set:

The model was developed on data set of 56 Major Speeches by Donald Trump (June 2015 — November 2016).

2. Importing Libraries:

import numpy as np
import pandas as pd
from flask import Flask, render_template,request
import time
import re
from collections import defaultdict
from nltk.stem import PorterStemmer
stop_words = []
with open ("Stopword-List.txt",'r') as file:
s=file.read().replace('\n',' ')
stop_words = s.split()
ps = PorterStemmer()

As above I only used Porter Stemmer module from NLTK library while other pre-processing and tokenization are manually done. Flask here is used to deploy our model on the server.

3. Making Inverted Index:

def inverted_index(stop_words):
dictionary = {}
documents = {}

for i in range(0,56):
doc_no = i
with open (“Trump Speechs\Trump Speechs\speech_” + str(doc_no) + “.txt”,’r’) as file:
next(file)
s=file.read().replace(‘\n’,’ ‘)

#cleaning documents
s = re.sub(‘ ‘, ‘ ‘, s)
s = re.sub(r”won’t”, “will not”, s)
s = re.sub(r”can\’t”, “can not”, s)
s = re.sub(r”n\’t”, “ not”, s)
s = re.sub(r”\’re”, “ are”, s)
s = re.sub(r”\’s”, “ is”, s)
s = re.sub(r”\’d”, “ would”, s)
s = re.sub(r”\’ll”, “ will”, s)
s = re.sub(r”\’t”, “ not”, s)
s = re.sub(r”\’ve”, “ have”, s)
s = re.sub(r”\’m”, “ am”, s)
s = re.sub(r’[0–9]+’, ‘’, s)
s=re.sub(r’[^\w\s]’,’ ‘, s)
key = ‘speech_’ + str(doc_no)

documents.setdefault(key,[])
documents[key].append(s)

#removing stopwords and lowercase
s = s.lower()
s = [words if words not in stop_words else ‘’ for words in s.split(‘ ‘)]
doc = []
doc = list(filter(None, s))
stemmed = []

#stemming
for i in doc:
stemmed.append(ps.stem(i))

#creating posting list
for x in stemmed:
key = x
dictionary.setdefault(key, [])
dictionary[key].append(doc_no)

#removing duplicates
dictionary = {a:list(set(b)) for a, b in dictionary.items()}

return dictionary,documents

In the above function each speech file is explicitly read, cleaned, tokenized and stemmed. For making inverted index dictionary each word is taken as key and the value of the key is the document id. There are many duplicates in our dictionary values as one word can occur multiple times in same document. Inverted Index can only be used for Boolean queries and conjunctive Boolean queries. Hence for phrase queries and proximity queries we use positional index.

4. Making Positional Index:

def positional_index(stop_words):

dictionary = {}
documents = {}
for i in range(0,56):
doc_no = i
with open (“Trump Speechs\Trump Speechs\speech_” + str(doc_no) + “.txt”,’r’) as file:
s=file.read().replace(‘\n’,’ ‘)[1:]

#cleaning documents
s = re.sub(‘ ‘, ‘ ‘, s)
s = re.sub(r”won’t”, “will not”, s)
s = re.sub(r”can\’t”, “can not”, s)
s = re.sub(r”n\’t”, “ not”, s)
s = re.sub(r”\’re”, “ are”, s)
s = re.sub(r”\’s”, “ is”, s)
s = re.sub(r”\’d”, “ would”, s)
s = re.sub(r”\’ll”, “ will”, s)
s = re.sub(r”\’t”, “ not”, s)
s = re.sub(r”\’ve”, “ have”, s)
s = re.sub(r”\’m”, “ am”, s)
s=re.sub(r’[^\w\s]’,’ ‘, s)

key = ‘speech_’ + str(doc_no)
documents.setdefault(key,[])
documents[key].append(s)

s = s.lower()
s = s.split(‘ ‘)
doc = []
doc = list(filter(None, s))
temp_dict = {}
stemmed = []

#stemming
for i in doc:
stemmed.append(ps.stem(i))

#creating positional index posting lists
a = 0
for x in stemmed:
key = x
temp_dict.setdefault(key, [])
temp_dict[key].append(a)
a += 1
for x in temp_dict:
if dictionary.get(x):
dictionary[x][doc_no] = temp_dict.get(x)
else:
key = x
dictionary.setdefault(key, [])
dictionary[key] = {}
dictionary[x][doc_no] = temp_dict.get(x)

return dictionary,documents

The function is almost similar to the inverted index. In this function I have made dict of dict to store the positions of the word corresponding to the document id’s. In this we don’t have to remove duplicates as it will be considered as position of the word in the document.

5. Query Processing:

  • Boolean Query processing:

Boolean queries are of type that contains Boolean Operators (AND, OR and NOT). Simple Boolean queries examples:

X AND Y: represents doc that contains both X and Y

X OR Y: represents doc that contains either X or Y

NOT X: represents the doc that do not contain X

Conjunctive Boolean queries examples:

(X AND Y) OR (Y AND Z) : represent doc that contain either both X and Y or Y and Z

(X AND Y) OR NOT Z : represent doc that contain either both X and Y or doc that do not contain Z.

Firstly the user query is converted to postfix expression in order to entertain both simple and conjunctive queries.

def postfix(infix_tokens):

#precendence initialization
precedence = {}
precedence[‘NOT’] = 3
precedence[‘AND’] = 2
precedence[‘OR’] = 1
precedence[‘(‘] = 0
precedence[‘)’] = 0
output = []
operator_stack = []

#creating postfix expression
for token in infix_tokens:
if (token == ‘(‘):
operator_stack.append(token)
elif (token == ‘)’):
operator = operator_stack.pop()
while operator != ‘(‘:
output.append(operator)
operator = operator_stack.pop()

elif (token in precedence):
if (operator_stack):
current_operator = operator_stack[-1]
while (operator_stack and precedence[current_operator] > precedence[token]):
output.append(operator_stack.pop())
if (operator_stack):
current_operator = operator_stack[-1]
operator_stack.append(token)else:
output.append(token.lower())

#while staack is not empty appending
while (operator_stack):
output.append(operator_stack.pop())
return output

The output is the sent to the process_query function to evaluate postfix expression.

def process_query(q,dictionary_inverted):q = q.replace(‘(‘, ‘( ‘)
q = q.replace(‘)’, ‘ )’)
q = q.split(‘ ‘)
query = []
for i in q:
query.append(ps.stem(i))
for i in range(0,len(query)):
if ( query[i]== ‘and’ or query[i]== ‘or’ or query[i]== ‘not’):
query[i] = query[i].upper()
results_stack = []
postfix_queue = postfix(query)
#evaluating postfix query expression
for i in postfix_queue:
if ( i!= ‘AND’ and i!= ‘OR’ and i!= ‘NOT’):
i = i.replace(‘(‘, ‘ ‘)
i = i.replace(‘)’, ‘ ‘)
i = i.lower()
i = dictionary_inverted.get(i)
results_stack.append(i)
elif (i==’AND’):
a = results_stack.pop()
b = results_stack.pop()
results_stack.append(AND_op(a,b))
elif (i==’OR’):
a = results_stack.pop()
b = results_stack.pop()
results_stack.append(OR_op(a,b))
elif (i == ‘NOT’):
a = results_stack.pop()
print(a)
results_stack.append(NOT_op(a))

return results_stack.pop()

AND, OR and NOT operation functions are as follows:

#AND two posting lists
def AND_op(word1,word2):
if ((word1) and (word2)):
return set(word1).intersection(word2)
else:
return set()

#OR two posting lists
def OR_op(word1,word2):
return set(word1).union(word2)

#NOT two posting lists
def NOT_op(a):
tot_docs = list(range(0,56))
return set(tot_docs).symmetric_difference(a)

Now we have implemented boolean query processing we move toward proximity processing.

  • Proximity Query processing:

Queries of the type X AND Y /3 or X Y /2 are called proximity queries. The expression means retrieve documents that contain both X and Y and 3 words or 2 words apart respectively.

def proximity_query(q,dictionary_positional):

q = re.sub(r”AND”, “”, q)
q = re.sub(r” “, “ “, q)
q = q.split(‘ ‘)
query = []

for i in q:
query.append(ps.stem(i))

word1 = dictionary_positional.get(query[0])
word2 = dictionary_positional.get(query[1])
anding = set(word1).intersection(word2)

query[2] = re.sub(r”/”, “”, query[2])
answer = []
skip = int(query[2]) + 1
for i in anding:
pp1 = dictionary_positional.get(query[0])[i]
pp2 = dictionary_positional.get(query[1])[i]
plen1 = len(pp1)
plen2 = len(pp2)
ii = jj = 0
while ii != plen1:
while jj != plen2:
if (abs(pp1[ii] — pp2[jj]) == skip):
answer.append(i)
elif pp2[jj] > pp1[ii]:
break
jj+=1
ii+=1
answer = list(dict.fromkeys(answer))
return answer

In the above function firstly the words and the skip number is extracted and the posting lists of the two words are obtained. Then the two lists are ANDed and the documents which contain both words are obtained. Then the intersection of the two lists are iterated to to check whether the two words are apart that is provided in user query.

6. Results:

The following results are obtained when Boolean Model is deployed on the local server.

Landing Page:

Homepage

When query entered: NOT running

Query result: NOT running

When query entered : After years /1

Query result: After years /1

When query entered : outdated OR (personnel AND policies)

Query result: outdated OR (personnel AND policies)

References:

[1]:https://en.wikipedia.org/wiki/Information_retrieval

[2]:https://en.wikipedia.org/wiki/Boolean_model_of_information_retrieval

Other references:

The full code can be found on my github repository.

If this article helped you, please like and share with others. Feel free to write suggestions as well in the comments below!

Thank you!!!

--

--