How to implement the Boolean Retrieval Model?

7 min readMar 15, 2023

We all use google in our everyday lives whether it be for educational or work purposes or for some entertainment side. Have you ever thought about how google works? What is the basic idea behind this famous search engine? That’s where the idea of Data Mining and Information Retrieval comes into existence.

In this 21st century, we are surrounded by loads of data, but this data in its raw form is totally irrelevant if it's not presented in a processed format. The conversion of raw data into meaningful and relevant information according to the user’s needs is called Data Mining.

Information Retrieval is a subset of the Data Mining field that deals with the finding of documents of an unstructured nature from a corpus (set of documents) that satisfies an information need. There are many kinds of IR models for different requirements among which the most basic one is the Boolean Retrieval Model which gives flat results about the term getting queried. The term flat results refer to the existence of a particular term in a document. The Boolean Model is only able to handle simple queries and fails in precision when a complex query is encountered.

Implementation of a Boolean Retrieval Model

The two basic elements which need to be created are the Dictionary and the Posting List. The Dictionary contains all the terms extracted from the corpus which are sorted in lexicographical order (alphabetical order) and the Posting list consists of all the document IDs (document numbers) for each term respectively.

In my implementation, the positional index for each word has also been created which defines the position of the word in the respective document. This helps in handling Proximity Queries which are in the form word1 word2 \k. Here, ‘k’ represents the number of words after which word2 should exist. (k words apart from word1)

In our case, the set of documents that we are dealing with is “Cricket Reviews”. It consists of reviews of famous cricketing situations that happened in the past. The set consists of 30 documents and the first step was to read each document and store them as a list.

def readDataset(self):  # reading data set file by file and storing as a list
        for i in range(30):
            filename = 'CricketReviews/' + str(i+1) + '.txt'
            lines = []
            with open(filename) as file:
                lines = file.readlines()
            tempDoc = ''
            for line in lines:
                if line == '\n':
                    continue
                if '\n' in line:
                    tempDoc += line.replace('\n', '. ')
                else:
                    tempDoc += line
            self.__files.append(tempDoc)

The next step is to read each document and then tokenize and normalize it. This is one of the most important steps in implementing this model. The steps to tokenize and normalize are:

Remove White Spaces: This refers to the removal of leading, trailing, and in-between white spaces existing in a document. Since the documents are unstructured, there is a huge possibility of the existence of extra unnatural white spaces.
True Casing: This refers to converting the whole document into a single uniform case i.e either lowercase or uppercase. This uniform structure leads to ease in looking up that specific word in the dictionary and the query formulation becomes a lot easier.
Tokenize Sentences: This step is a little unnecessary and you can totally avoid this if you want to but the idea is to utilize the memory efficiently while processing the documents and thus small sized chunks are easier to handle than one humongous piece of document.
Remove Punctuation: The dictionary is a collection of valid English language words and the inclusion of words with proper punctuation is totally unnecessary. This step removes all the punctuations like ‘.,!?’ and many more which simplifies the document and we get closer to getting the refined, dictionary-acceptable set of words.
Tokenize Words: These smaller chunks of documents a.k.a sentences are further split into words on the basis of “blank spaces” since all the other elements like punctuations and extra white spaces are removed already.
Remove Contractions: The words which got split earlier could have words joined together with the help of apostrophes and this could lead to bad precision when executing the query. So, to cater to this situation, the contracted words are expanded into two separate words and then stored back in the list.
Remove Stopwords: Stopwords refer to the words appearing redundantly in a sentence. They include words like “and, the, a, is, be” and many more. Their existence should not be necessary for a dictionary as the user query probably would not be about them since the boolean query format doesn’t allow a phrase query (natural text query).
Lemmatization: Lemmatization is a rigorous process that uses a dictionary and uses context to determine the lemma, considered a slow approach. This lemmatization process helps in increasing the precision percentage of the system which refers to how much relevant the results are to the user’s query.
Stemming: It is a heuristic-rule-based approach, generally fast, and uses a single term. It generates unreadable tokens. This process helps in increasing the recall percentage of the system which refers to the number of relevant documents returned from the actual set of relevant documents residing in the corpus.

After performing all the above-mentioned steps, you are now ready to append the normalized set of words to your dictionary and their respective posting list. Both the dictionary and posting list are then written to separate files for faster retrieval during the query execution.

Types of Queries

The queries in this boolean model consist of at most three terms along with three basic boolean operators which include AND, OR, and NOT. This model is also capable of handling proximity queries as well.

Intersection Query (AND)

def andQuery(self, term1, term2):
        result = []
        i, j = 0, 0
        if term1 in self.__words:
            termKeyOne = list(self.__words[term1][0].keys())
        if term2 in self.__words:
            termKeyTwo = list(self.__words[term2][0].keys())
        if term1 not in self.__words or term2 not in self.__words:
            return []
        while i < len(self.__words[term1][0]) and j < len(self.__words[term2][0]):
            if termKeyOne[i] == termKeyTwo[j]:
                result.append(termKeyOne[i])
                i += 1
                j += 1
            elif termKeyOne[i] < termKeyTwo[j]:
                i += 1
            else:
                j += 1
        return result

Union Query (OR)

def orQuery(self, term1, term2):
        result = []
        if term1 in self.__words:
            result += list(self.__words[term1][0].keys())
        if term2 in self.__words:
            result += list(self.__words[term2][0].keys())
        return sorted(list(set(result)))

Complement Query (NOT)

def notQuery(self, term):
        if term in self.__words:
            presentDocs = list(self.__words[term][0].keys())
        result = []
        i = 1
        while i <= 30:
            if i in presentDocs:
                pass
            else:
                result.append(i)
            i += 1
        return result

Proximity Query

def proximityQuery(self, term1, term2, k):
        result = []
        temp = self.andQuery(term1, term2)
        index1 = []
        index2 = []
        for doc in temp:
            index1 = self.__words[term1][0][doc]
            index2 = self.__words[term2][0][doc]
            i = 0
            while i < len(index1):
                for j in range(len(index2)):
                    if index1[i]+k+1 == index2[j] or index1[i]+k == index2[j]  or index2[j]+k+1 == index1[i] or index2[j]+k == index1[i]:
                        if doc not in result:
                            result.append(doc)
                i += 1
        return result

User Interface

In this project, the user is also provided with a graphical user interface for smooth query execution. The design is minimalistic and user-friendly.

As you can see, it contains an entry box to write the query and a search button to execute it and get the desired results.

Example Queries and Results

Conclusion

The advantage of the Boolean retrieval model is that it is easier to implement plus the query formulation is easy but on the other hand, the user is unable to execute phrase or natural text queries. The Boolean Model is also not able to rank the documents according to the similarity they possess with respect to the user’s information need but thankfully, this problem could be handled by the vector space model which is a topic for another article. The Boolean model seeds the idea for further advanced Information Retrieval Models and serves perfectly for a small set of documents where the only requirement is to know the documents where the word exists.

Github Repository Link

https://github.com/SyedMuhammadFaheem/InformationRetrieval

References

Text Normalization for Natural Language Processing in Python

Text Normalization is an important part of preprocessing text for Natural Language Processing. There are several common…

lvngd.com

If you liked this article, follow me here for more tech-related stuff.

My Linkedin: https://www.linkedin.com/in/muhammadfaheemnu/.