Mastering Information Retrieval: Building Intelligent Search Systems (Chapter 1)

11 min readApr 1, 2023

Chapter 1: Keyword Search: The Foundation of Information Retrieval

Reference Chapter: “Evaluating Information Retrieval Models: A Comprehensive Guide to Performance Metrics”
Chapter 1: “Keyword Search: The Foundation of Information Retrieval”
Chapter 2: “TF-IDF and Vector Space Models: Beyond Keyword Search”
Chapter 3: “Latent Semantic Indexing: Uncovering Hidden Relationships”
Chapter 4: “Word2Vec and Doc2Vec: Capturing Semantic Relationships with Distributed Representations”
Chapter 5: “Transformers in Information Retrieval: Leveraging Pre-trained Models for Enhanced Search”
Chapter 6: “Incorporating Natural Language Processing: Enhancing Search with Language Understanding”
Google Colab File: Coming Soon…
Dataset: Cranfield Collection Data

1. Introduction

1.1. Brief overview of the “Mastering Information Retrieval” series

Information retrieval is a crucial aspect of modern computing systems, as it enables users to find relevant and accurate information from massive collections of data. This article series, “Mastering Information Retrieval: Building Intelligent Search Systems,” aims to provide a comprehensive understanding of the various techniques, models, and methodologies used in the field of information retrieval. We will explore different models, ranging from the foundational keyword search to advanced machine learning-based techniques, and discuss how these models can be built, implemented, and evaluated. By the end of this series, we will have a solid understanding of information retrieval techniques and be able to make informed decisions about which model is most suitable for our specific use case.

1.2. Models to be covered in the series

Throughout this series, we will cover the following information retrieval models:

Keyword Search
TF-IDF and Vector Space Models
Latent Semantic Indexing
Search Personalization and Collaborative Filtering
Incorporating Natural Language Processing
Learning to Rank
Evaluating Search Systems

1.3. Importance of evaluating and comparing different information retrieval models

Evaluating and comparing different information retrieval models is essential to understanding their strengths and weaknesses. This will enable us to make informed decisions about which model is most suitable for our specific needs. In this series, we will discuss various evaluation metrics and methodologies for measuring the performance of information retrieval systems, helping us to choose the best model for our application.

2. Keyword Search Basics

2.1. Definition and concept of keyword search

Keyword search is the foundation of information retrieval and one of the most straightforward techniques for finding relevant documents in a large dataset. The idea behind keyword search is to match the user’s query, composed of one or more keywords, with documents that contain these keywords. In essence, the model identifies documents that are relevant to the user’s query based on the presence of the specified keywords within the document text.

2.2. Importance of keyword search in information retrieval

Despite its simplicity, keyword search remains an essential tool in information retrieval due to its ease of implementation and effectiveness in a wide range of applications. It is often the starting point for more advanced search techniques and serves as a baseline against which other models can be compared. Keyword search is widely used in search engines, databases, and various software applications to help users find the information they are looking for quickly and efficiently.

3. Types of Keyword Search

3.1. Boolean search

3.1.1. AND, OR, and NOT operators

Boolean search is a type of keyword search that allows users to combine keywords using logical operators such as AND, OR, and NOT. This enables users to narrow or broaden their search results, making it easier to find relevant documents. The AND operator retrieves documents containing all the specified keywords, the OR operator retrieves documents containing at least one of the keywords, and the NOT operator excludes documents containing a specific keyword.

3.1.2. Use cases and examples

Boolean search is particularly useful when users have specific requirements or want to filter out irrelevant results. For example:

“machine learning” AND “neural networks” will retrieve documents containing both terms.
“machine learning” OR “deep learning” will retrieve documents containing either term.
“machine learning” NOT “supervised” will retrieve documents containing “machine learning” but not “supervised.”

3.2. Phrase search

3.2.1. Concept and use cases

Phrase search is a type of keyword search that allows users to find documents containing an exact phrase rather than individual keywords. By enclosing a phrase in quotes, the search engine will retrieve documents containing the exact sequence of words within the quotes. Phrase search is useful when users are looking for specific information or when common words are used in a unique context.

3.2.2. Examples

For example:

Searching for “natural language processing” will retrieve documents containing the exact phrase rather than documents containing only “natural,” “language,” or “processing.”

3.3. Wildcard search

3.3.1. Single-character and multiple-character wildcards

Wildcard search allows users to find documents containing variations of a keyword by using special characters to replace one or more characters within the keyword. The single-character wildcard (usually represented by a question mark) replaces exactly one character, while the multiple-character wildcard (usually represented by an asterisk) can replace any number of characters.

3.3.2. Use cases and examples

Wildcard search is useful when users are unsure about the exact spelling of a keyword or want to find documents containing variations of a word. For example:

Searching for “comput?” will retrieve documents containing “compute,” “computer,” and “computed.”
Searching for “optimi*” will retrieve documents containing “optimization,” “optimize,” “optimal,” and other variations.

4. Building a Keyword Search Model

4.1. Data preparation and preprocessing

Before building a keyword search model, it is essential to preprocess the dataset, including tokenizing the text, removing stop words, and stemming or lemmatizing words. This ensures that the search results are accurate and relevant.

def parse_cran_documents(file_path):
    with open(file_path, 'r') as f:
        content = f.read()
    docs_raw = content.split(".I ")[1:]
    docs = {}
    for doc in docs_raw:
        lines = doc.strip().split("\n")
        doc_id = int(lines[0])
        text = " ".join(lines[2:])
        docs[doc_id] = text
    return docs

def parse_cran_queries(file_path):
    with open(file_path, 'r') as f:
        content = f.read()
    queries_raw = content.split(".I ")[1:]
    queries = {}
    for query in queries_raw:
        lines = query.strip().split("\n")
        query_id = int(lines[0])
        text = " ".join(lines[1:]).replace(".W", "").strip()
        queries[query_id] = text
    return queries

def parse_cran_relevance(file_path):
    with open(file_path, 'r') as f:
        content = f.read()
    lines = content.strip().split("\n")
    relevance_info = {}
    for line in lines:
        line_parts = line.split()
        query_id, doc_id = int(line_parts[0]), int(line_parts[1])
        if query_id not in relevance_info:
            relevance_info[query_id] = {"relevant_docs": []}
        relevance_info[query_id]["relevant_docs"].append(doc_id)
    return relevance_info

documents = parse_cran_documents("gdrive/My Drive/datasets/Mastering Information Retrieval/Cranfield Collection/cran.all.1400")
queries = parse_cran_queries("gdrive/My Drive/datasets/Mastering Information Retrieval/Cranfield Collection/cran.qry")
relevance_info = parse_cran_relevance("gdrive/My Drive/datasets/Mastering Information Retrieval/Cranfield Collection/cranqrel")

# Create documents dataframe
documents_df = pd.DataFrame(list(documents.items()), columns=['doc_id', 'text'])

# Create queries dataframe
queries_df = pd.DataFrame(list(queries.items()), columns=['query_id', 'text'])

# Create relevance information dataframe
relevance_list = []
for query_id, query_info in relevance_info.items():
    relevant_docs = query_info['relevant_docs']
    for doc_id in relevant_docs:
        relevance_list.append([query_id, doc_id])

relevance_df = pd.DataFrame(relevance_list, columns=['query_id', 'doc_id'])

def preprocess(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and digits
    text = re.sub(r'\W+|\d+', ' ', text)

    # Remove stopwords and lemmatize
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])

    return text

# Preprocess documents and queries
documents_df['preprocessed_text'] = documents_df['text'].apply(preprocess)
queries_df['preprocessed_text'] = queries_df['text'].apply(preprocess)

4.2. Implementing keyword search

In this section, we will discuss the implementation of the four information retrieval models: simple keyword search, boolean search, phrase search, and wildcard search. The following Python code demonstrates how each of these models can be implemented:

#Simple Keyword Search
def keyword_search(query, documents):
    keywords = set(query.split())
    matched_docs = []
    
    for idx, doc in documents.iterrows():
        if keywords.intersection(set(doc['preprocessed_text'].split())):
            matched_docs.append(idx)
    
    return matched_docs

#Boolean Search
def boolean_search(query, documents, operator='OR'):
    keywords = set(query.split())
    matched_docs = []
    
    for idx, doc in documents.iterrows():
        doc_keywords = set(doc['preprocessed_text'].split())
        if operator == 'OR':
            if keywords.intersection(doc_keywords):
                matched_docs.append(idx)
        elif operator == 'AND':
            if keywords.issubset(doc_keywords):
                matched_docs.append(idx)
    
    return matched_docs

#Phrase Search
def phrase_search(query, documents):
    phrase = query.strip()
    matched_docs = []
    
    for idx, doc in documents.iterrows():
        if phrase in doc['preprocessed_text']:
            matched_docs.append(idx)
    
    return matched_docs

#Wildcard Search
def wildcard_search(query, documents):
    query = query.replace('*', '.*').replace('?', '.{1}')
    query_regex = re.compile(query)
    matched_docs = []
    
    for idx, doc in documents.iterrows():
        if query_regex.search(doc['preprocessed_text']):
            matched_docs.append(idx)
    
    return matched_docs

4.3. Evaluation of the keyword search model

In this section, we will discuss the evaluation process for the keyword search models we implemented in the previous section. To assess the performance of these models, we will use three evaluation metrics: precision, recall, and F1-score. The following Python code demonstrates the evaluation function used to calculate these metrics for each search model:

def evaluate_search_model(search_function, search_args, queries_df, documents_df, relevance_info):
    precision_scores = []
    recall_scores = []
    f1_scores = []

    for query_id, query_info in relevance_info.items():
        if query_id not in queries_df.index:
            print(f"Query ID {query_id} not found in the queries DataFrame. Skipping evaluation for this query.")
            continue

        relevant_docs = set(query_info['relevant_docs'])
        retrieved_docs = set(search_function(queries_df.loc[query_id, 'preprocessed_text'], documents_df, **search_args))
        
        true_positives = len(retrieved_docs.intersection(relevant_docs))
        false_positives = len(retrieved_docs - relevant_docs)
        false_negatives = len(relevant_docs - retrieved_docs)
        
        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        
        precision_scores.append(precision)
        recall_scores.append(recall)
        f1_scores.append(f1)

    mean_precision = np.mean(precision_scores)
    mean_recall = np.mean(recall_scores)
    mean_f1 = np.mean(f1_scores)

    return mean_precision, mean_recall, mean_f1

Precision, Recall, and F1-score were chosen as evaluation metrics for this problem scenario because they provide a comprehensive understanding of the search model’s performance in information retrieval tasks. These metrics balance different aspects of a model’s performance and are widely used in the field of information retrieval. Here’s a brief explanation of each metric and why they were chosen:
Precision: Precision measures the proportion of true positive (relevant) documents among the retrieved documents. It is important in scenarios where the quality of the retrieved results is crucial, and the users do not want to see irrelevant results. High precision means that the model retrieves mostly relevant documents, reducing the user’s effort to filter out irrelevant information.
Recall: Recall measures the proportion of true positive (relevant) documents retrieved out of all the relevant documents in the dataset. It is crucial in scenarios where users want to ensure they do not miss any relevant documents. High recall means that the model retrieves most of the relevant documents available, maximizing the chances of users finding the information they seek.
F1-score: F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall, providing a single metric that takes both aspects into account. It is especially useful when there’s no clear preference between precision and recall, and we want a model that performs well on both fronts. A high F1-score indicates that the model is retrieving relevant documents while minimizing the number of irrelevant documents.
Other metrics, such as accuracy, are less suitable for information retrieval tasks, as they may not accurately represent the model’s performance due to imbalanced datasets. In many real-world scenarios, the number of relevant documents is much smaller than the number of irrelevant ones. In such cases, accuracy could be misleading, as a model that always predicts “irrelevant” would still achieve high accuracy. Precision, recall, and F1-score are more robust in handling such imbalanced situations and are better suited for evaluating information retrieval models.

The evaluation function takes the search function, search arguments, query dataframe, document dataframe, and relevance information as inputs. It iterates through the queries, retrieves relevant and retrieved documents for each query, and calculates true positives, false positives, and false negatives. Then, it computes precision, recall, and F1-score for each query and appends the scores to their respective lists. Finally, the function returns the mean precision, mean recall, and mean F1-score.

The evaluation results for each search model are as follows:

Simple Keyword Search:
Mean Precision: 0.01
Mean Recall: 0.70 Mean
F1-score: 0.01
Boolean Search (OR):
Mean Precision: 0.01
Mean Recall: 0.70 Mean
F1-score: 0.0
Phrase Search:
Mean Precision: 0.00
Mean Recall: 0.00
Mean F1-score: 0.00
Wildcard Search:
Mean Precision: 0.00
Mean Recall: 0.00
Mean F1-score: 0.00

From these results, we can see that the simple keyword search and boolean search with the OR operator have higher recall values but lower precision, indicating that they retrieve a large number of relevant documents but also include many irrelevant documents. The phrase search and wildcard search models, on the other hand, have very low precision and recall, which suggests that they are not effective in retrieving relevant documents in this particular dataset.

These evaluation results can help us understand the performance of each search model and make informed decisions when choosing the most suitable model for our information retrieval tasks. It is important to note that these results are specific to this dataset, and the performance of each model may vary depending on the dataset being used. Therefore, it is crucial to evaluate each model on the specific dataset we are working with to ensure the best possible performance.

In conclusion, the evaluation of keyword search models helps us identify the strengths and weaknesses of each model. The simple keyword search and boolean search with the OR operator may be useful when we want to prioritize recall and ensure that we do not miss any relevant documents. However, if precision is more important, these models might not be the most appropriate choice due to the high number of irrelevant documents they may retrieve.

In contrast, the phrase search and wildcard search models show very low precision and recall values in this evaluation, indicating that they might not be suitable for retrieving relevant documents in this dataset. However, these models could still be useful in specific scenarios where exact phrase matching or pattern matching is required.

5. Comparison with Other Information Retrieval Models

5.1. Brief overview of the models covered in the series

In this article, we implemented and evaluated four information retrieval models: simple keyword search, boolean search, phrase search, and wildcard search. A simple keyword search retrieves documents that contain at least one of the query keywords. Boolean search retrieves documents based on logical operations (AND/OR) applied to query keywords. Phrase search retrieves documents containing the exact phrase specified in the query. Wildcard search retrieves documents that match a query containing wildcard characters, such as ‘*’ for multiple characters or ‘?’ for a single character.

5.2. Advantages and disadvantages of keyword search compared to other models

Keyword search and boolean search with the ‘OR’ operator show similar performance, as they both retrieve documents containing at least one of the query keywords. Phrase search and wildcard search, however, show poor performance with the given dataset, as their precision, recall, and F1-score values are all 0.00. This may be due to the dataset not being well-suited for these search methods or potential issues with the implementation.

Advantages of keyword search:

Simple and easy to implement
Provides reasonable recall values

Disadvantages of keyword search:

Lower precision compared to more advanced search methods
Ignores the context and order of query keywords

5.3 When to choose keyword search over other models

A keyword search can be a suitable choice in situations where simplicity and ease of implementation are prioritized, and when users need to find documents containing specific keywords without considering the context or order of the keywords. However, for more advanced or precise search requirements, other search methods, such as phrase search, wildcard search, or more sophisticated information retrievals models like vector space models or language models, might be more appropriate.

6. Conclusion

6.1. Recap of the main points discussed in the article

In this article, we discussed the basics of information retrieval and focused on building and evaluating a keyword search model. We also implemented and compared boolean search, phrase search, and wildcard search models. The evaluation results showed similar performance between keyword search and boolean search with the ‘OR’ operator, while phrase and wildcard search had poor performance with the given dataset.

6.2. Importance of understanding the foundation of information retrieval

Understanding the foundation of information retrieval is essential, as it enables us to develop more advanced search and retrieval models. By understanding the basic principles and techniques, we can better choose the most suitable model for a particular application and adapt or extend existing models to address specific needs.

6.3. What to expect in the upcoming articles of the series

In the upcoming articles of this series, we will explore more advanced information retrieval models and techniques, such as vector space models, probabilistic models, and language models. We will also discuss how to implement and evaluate these models, compare their performance, and provide insights into when to choose one model over another based on specific use cases and requirements.