The very first and naive information retrieval system

3 min readMar 26, 2022

We are about to implement our very first IR system in this article. Previously, we talked about tokenization. if you are not familiar with this topic, you can check that from here:

Tokenization in Natural Language Processing

One of the primary and vital steps in natural language processing is tokenization. Tokenization means converting the…

medium.com.

Before we jump into implementation, we need to understand some concepts. The first one is Bag Of Word. a Bag Of Word or BOW is a dictionary that contains the word frequency. Let’s clear this concept with an example. Suppose we have a tokenized version of our sample text provided in this manner:

Now, the BOW is like this:

The Bag Of Word is practical when you want to know which words you have in documents. But sometimes, you need to know a specific term in which records exist. If you use BOW, you need to explore all documents and see whether the value of that word in each document is more than zero. There is another support data structure that makes this task easier. It is called the Inverted Index. An inverted index is a dictionary containing words as key and a list of all documents with that word as value.

If I want to know which words are in each document, I can use BOW, and if I want to see the word that appears in which documents, I use Inverted Index.

Implementation of these two data structures is not complex. BOW is a dictionary of a dictionary. (because we have a BOW for each document) and Inversed Index is just a dictionary.

Now that we are familiar with the basic concept, let’s create our IR system. Let’s see what we want to do. We have a dataset containing recipes. We also have a query and want to find the most related documents to that query. So, at first, we create the document-term matrix. The rows of this matrix are documents, and the columns are unique words in whole documents.

running the code above, we generate the following output:

The second step is to generate an array with a length equal to the number of unique words in whole documents filled with the frequency of each word in the query. For example, our dataset has 1500 individual words, and the query is: “here is the first query for the test.” In this case, we need to generate an array with a length of 1500, filled with the frequency of each word in the query. The other array elements are zero except for seven positions.

We generate the same array for each document. Now we need to calculate the similarity of the query array with each document array. For doing that, we use Cosine Similarity:

from sklearn.metrics.pairwise import cosine_similarity
Sigma = cosine_similarity(qv.reshape(1, -1), T)
answers = sorted(enumerate(Sigma[0]), key=lambda x: -x[1])

The value of Cosine Similarity shows the amount of relevance. So, as much as the cosine similarity is higher, the relevance is higher. Finally, we have document id and cosine similarity sorted based on the similarity value in answers. This approach works but has lots of problems. For example, it is not context-sensitive, or if we have lots of meaningless words in the query, like and, of, a, the, etc., the result would be biased. Also, the indexing mechanism is not satisfying. What if we want to use this system for web queries? In this case, we need to grab all the web content and generate BOW to have the document-term matrix! Another significant problem is if we calculate the most frequent words in each document, we will see most of them are meaningless words like a, of, the, etc. which means if we have a query that contains plenty of “the,” it will match with a high probability to that document.

This article is inspired by the topics taught by Professor Alfio Ferrara at the University of Milan.

The very first and naive information retrieval system

Tokenization in Natural Language Processing

One of the primary and vital steps in natural language processing is tokenization. Tokenization means converting the…

Written by Mohammad Derakhshan