Information Retrieval using Boolean Query in Python

ASHNA JAIN
Voice Tech Podcast
Published in
7 min readApr 6, 2019

Information retrieval is the process of extracting useful information from unstructured data that satisfies information needs from large collection of data. It remains one of the most challenging task of NLP, due to the vast amount of unstructured data used for processing. Millions of people retrieve information in one form or the other. For example, when we search the phrase “countries in asia”, two predominant words, countries and asia come into picture and we need to ensure that the machine includes the semantics of the phrase while retrieving information from the documents.

Information retrieval works on different scales. It can be either in the form of a web search, where relevant information is selected from millions of documents or it can be in the form of personal information retrieval, as observed in the case of a simple email filtering. Given a collection of documents, information retrieval helps in filtering out most important documents based on the keywords specified in the query provided by the user.

In this article, we will be using boolean queries to retrieve the most important documents from all documents in the data set. The extracted documents will fulfill the user’s request by retrieving information on the basis of semantic nature of the query.

Still confused??

Don’t worry let’s walk through an example.

Consider that we have these set of documents: india.txt, narendra_modi.txt , ,rahul_gandhi.txt, apple.txt , australia.txt , cricket.txt, football.txt , volleyball.txt. These documents contain information regarding the name of the respective document. Now when the user inputs “bjp and india or congress and india” as a query, we retrieve narendra_modi.txt and rahul_gandhi.txt as the output documents. Basically we get those set of documents from the data set, which satisfy the input query.

Excited about learning the approach?? Let’s see how it works.

Overview of the approach used

Flow diagram

Linkedlist is used in this approach as it occupies less space because it links and processes only those files which contain the word, rather than processing all the files of the dataset.

If the flow diagram is not very clear, don’t worry some technical terms will be covered in the coding section.

Let’s start coding!!!

Python code for implementing Information Retrieval using Boolean Query

Step-1 Importing the necessary libraries

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import sent_tokenize , word_tokenize
import glob
import re
import os
import numpy as np
import sys
Stopwords = set(stopwords.words('english'))

As I had mentioned in my previous article, NLTK is the most important library for NLP in Python. NLTK contains packages for lemmatizing and tokenizing words, which are crucial pre-processing steps while dealing with text data.

Step-2 Finding the set of unique words from all documents of the data set

all_words = []
dict_global = {}
file_folder = 'data/*'
idx = 1
files_with_index = {}
for file in glob.glob(file_folder):
print(file)
fname = file
file = open(file , "r")
text = file.read()
text = remove_special_characters(text)
text = re.sub(re.compile('\d'),'',text)
sentences = sent_tokenize(text)
words = word_tokenize(text)
words = [word for word in words if len(words)>1]
words = [word.lower() for word in words]
words = [word for word in words if word not in Stopwords]
dict_global.update(finding_all_unique_words_and_freq(words))
files_with_index[idx] = os.path.basename(fname)
idx = idx + 1

unique_words_all = set(dict_global.keys())

This code helps in finding the important documents from the list of documents. The variable file_folder is the path to the dataset, which contains files with information about different topics. Here we access all the files one by one and pre-process the information in each file using steps defined in my previous article. My previous article can be accessed from here. After pre-processing, we update the dict_global variable each time by adding all the unique words found in the document. The variable files_with_index stores the index of each file.It basically stores (index,filename) as (key,value) pair. Finally we find all the unique words by set(dict_global.keys()) which gives the set of unique words and store it in the unique_words_all. Some functions used in the above code are explained in the next section.

The data files are stored in data folder.

Step-3 Implementing helper functions

def finding_all_unique_words_and_freq(words):
words_unique = []
word_freq = {}
for word in words:
if word not in words_unique:
words_unique.append(word)
for word in words_unique:
word_freq[word] = words.count(word)
return word_freq
def finding_freq_of_word_in_doc(word,words):
freq = words.count(word)

def remove_special_characters(text):
regex = re.compile('[^a-zA-Z0-9\s]')
text_returned = re.sub(regex,'',text)
return text_returned

The function finding_all_unique_words_and_freq finds all the unique words along with the frequency. To remove all the special characters, we use remove_special_characters function. For detailed understanding of the two functions please refer here .

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Step-4 Defining the linked list

class Node:
def __init__(self ,docId, freq = None):
self.freq = freq
self.doc = docId
self.nextval = None

class SlinkedList:
def __init__(self ,head = None):
self.head = head

The Node class acts as a node for each word, by storing the docId and frequency of the word in the respective docId. The next document containing the word is linked to the current Node using nextval variable.

The SlinkedList class makes a head pointer of each unique word in the data set.

Confused?? Don’t worry in the next section we will see how to use them.

Step-5 Making a linkedlist for each word and storing all the nodes (containing the file name and frequency of the respective word ) in the linkedlist.

linked_list_data = {}
for word in unique_words_all:
linked_list_data[word] = SlinkedList()
linked_list_data[word].head = Node(1,Node)
word_freq_in_doc = {}
idx = 1
for file in glob.glob(file_folder):
file = open(file, "r")
text = file.read()
text = remove_special_characters(text)
text = re.sub(re.compile('\d'),'',text)
sentences = sent_tokenize(text)
words = word_tokenize(text)
words = [word for word in words if len(words)>1]
words = [word.lower() for word in words]
words = [word for word in words if word not in Stopwords]
word_freq_in_doc = finding_all_unique_words_and_freq(words)
for word in word_freq_in_doc.keys():
linked_list = linked_list_data[word].head
while linked_list.nextval is not None:
linked_list = linked_list.nextval
linked_list.nextval = Node(idx ,word_freq_in_doc[word])
idx = idx + 1

We will start by initializing a new linkedlist for every unique word.The first node of each linked list contains 1 as the default docId which can be ignored, which can be ignored for later processing tasks.After the linkedlist is initialized, each file in the data set is read word by word and all the unique words in the file is stored in word_freq_in_doc. Then words can be accessed one at a time from word_freq_in_doc dictionary and the linked list of the respective word appends a new node(containing the file and the frequency of that word in the file). Let us see an example specified below.

Linkedlist of word apple when apple is present in file number 1,2,5 and 7:

Apple->1->2->5->7

Now the fun part begins!!!

Step -6 Query processing and output generation

query = input('Enter your query:')
query = word_tokenize(query)
connecting_words = []
cnt = 1
different_words = []for word in query:
if word.lower() != "and" and word.lower() != "or" and word.lower() != "not":
different_words.append(word.lower())
else:
connecting_words.append(word.lower())
print(connecting_words)
total_files = len(files_with_index)
zeroes_and_ones = []
zeroes_and_ones_of_all_words = []
for word in (different_words):
if word.lower() in unique_words_all:
zeroes_and_ones = [0] * total_files
linkedlist = linked_list_data[word].head
print(word)
while linkedlist.nextval is not None:
zeroes_and_ones[linkedlist.nextval.doc - 1] = 1
linkedlist = linkedlist.nextval
zeroes_and_ones_of_all_words.append(zeroes_and_ones)
else:
print(word," not found")
sys.exit()
print(zeroes_and_ones_of_all_words)for word in connecting_words:
word_list1 = zeroes_and_ones_of_all_words[0]
word_list2 = zeroes_and_ones_of_all_words[1]
if word == "and":
bitwise_op = [w1 & w2 for (w1,w2) in zip(word_list1,word_list2)]
zeroes_and_ones_of_all_words.remove(word_list1)
zeroes_and_ones_of_all_words.remove(word_list2)
zeroes_and_ones_of_all_words.insert(0, bitwise_op);
elif word == "or":
bitwise_op = [w1 | w2 for (w1,w2) in zip(word_list1,word_list2)]
zeroes_and_ones_of_all_words.remove(word_list1)
zeroes_and_ones_of_all_words.remove(word_list2)
zeroes_and_ones_of_all_words.insert(0, bitwise_op);
elif word == "not":
bitwise_op = [not w1 for w1 in word_list2]
bitwise_op = [int(b == True) for b in bitwise_op]
zeroes_and_ones_of_all_words.remove(word_list2)
zeroes_and_ones_of_all_words.remove(word_list1)
bitwise_op = [w1 & w2 for (w1,w2) in zip(word_list1,bitwise_op)]
zeroes_and_ones_of_all_words.insert(0, bitwise_op);

files = []
print(zeroes_and_ones_of_all_words)
lis = zeroes_and_ones_of_all_words[0]
cnt = 1
for index in lis:
if index == 1:
files.append(files_with_index[cnt])
cnt = cnt+1

print(files)

We start be asking the user to input a query. The query should be a boolean query of the form: word1 connecting_word word2 connecting_word word3 ……..and so on. Note here connecting_word refers to and,or and not.

For example: Apple and fruit and india or mango.

The output of this query will fetch all the documents containing all three words(apple,fruit and india) or containing only mango word.

This code separates all connecting_words from other words. Here (apple,fruit,india and mango) will be stored in different_words variable and (and,and,or) will be stored in connecting_words variable. For the purpose of boolean operations we make a bitmap for each word other than the connecting_word in the query. This bitmap stores a 1 in the index of file if the file contains the word, 0 otherwise. After the bitmap is made, bitwise operations can be performed one by one by processing bitmaps on the basis of connecting_word provided in between two bitmaps.

After the processing is done, finally we output those files where the index on the bitmap shows 1.

For example(sample documents): india.txt, narendra_modi.txt , ,rahul_gandhi.txt, apple.txt , australia.txt , cricket.txt, football.txt , volleyball.txt . When we input: “bjp and india or congress and india” as a query, we get bitmap as [0,1,1,0,0,0,0,0]. From the bitmap we observe that, 1 is observed in index of output documents. Hence we get the following output documents: narendra_modi.txt and rahul_gandhi.txt.

The full code can be made by merging the individual codes provided in this article in the following sequence: Step 1->Step 3 -> Step 4->Step 2->Step 5->Step 6.

If this article helped you, please like and share with others. Feel free to write suggestions as well in the comments below!

Thank you!!!

--

--