Mastering Information Retrieval: Building Intelligent Search Systems (Chapter 2)

Prateek Gaurav
8 min readApr 1, 2023

--

Chapter 2: TF-IDF and Vector Space Models: Beyond Keyword Search

Reference Chapter: “Evaluating Information Retrieval Models: A Comprehensive Guide to Performance Metrics”
Chapter 1: “Keyword Search: The Foundation of Information Retrieval”
Chapter 2: “TF-IDF and Vector Space Models: Beyond Keyword Search”
Chapter 3: “Latent Semantic Indexing: Uncovering Hidden Relationships”
Chapter 4: “Word2Vec and Doc2Vec: Capturing Semantic Relationships with Distributed Representations”
Chapter 5: “Transformers in Information Retrieval: Leveraging Pre-trained Models for Enhanced Search”
Chapter 6: “Incorporating Natural Language Processing: Enhancing Search with Language Understanding”

Google Colab File: Coming Soon…
Dataset: Cranfield Collection Data

1. Introduction

1.1 Recap and Objectives for this Article

In the first part of the “Mastering Information Retrieval” series, we explored the foundation of information retrieval through keyword search. We discussed different types of keyword search, their implementation, and evaluation. In this article, we will delve into more advanced information retrieval techniques, namely Term Frequency-Inverse Document Frequency (TF-IDF) and Vector Space Models. These models improve upon the limitations of keyword search and enable more accurate and relevant search results.

This article will cover the following topics:

  1. The concept and importance of TF-IDF in information retrieval
  2. The Vector Space Model and its role in representing documents and queries
  3. Building and evaluating a search system based on TF-IDF and Vector Space Models
  4. Comparing the performance of TF-IDF and Vector Space Models with keyword search

Our goal is to provide a deeper understanding of these advanced models and how they enhance information retrieval systems.

1.2 Importance of Moving Beyond Keyword Search in Information Retrieval

While keyword search serves as a fundamental approach to information retrieval, it has its limitations. For instance, keyword search can return irrelevant results when the same term appears in different contexts or when documents use synonyms of a query term. In addition, keyword search does not consider the importance of terms within a document or across the document collection. To address these limitations and improve search accuracy, it is essential to explore more advanced information retrieval techniques such as TF-IDF and Vector Space Models.

2. TF-IDF: Term Frequency-Inverse Document Frequency

2.1 Definition and Concept of Tf-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used technique in information retrieval that helps to quantify the importance of terms within a document and across a collection of documents. The main idea behind TF-IDF is that terms that appear frequently in a document but rarely across the entire document collection are more significant and informative.

TF-IDF consists of two components:

  1. Term Frequency (TF): The number of times a term appears in a document. This measures the importance of a term within a specific document.
  2. Inverse Document Frequency (IDF): A measure of how common or rare a term is across the entire document collection. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.

2.2 Importance of TF-IDF in Information Retrieval

TF-IDF provides several advantages over simple keyword search:

  1. It gives more weight to terms that are frequent in a document but rare across the entire collection, searching results more relevant.
  2. It normalizes the term frequency, accounting for the varying lengths of documents.
  3. By considering both term frequency and inverse document frequency, it balances the importance of a term’s local and global significance.

These properties make TF-IDF an essential tool for improving the quality of search results in information retrieval systems.

2.3 Formula and Calculation of TF-IDF

To calculate the TF-IDF score for a term in a document, we need to multiply the term frequency (TF) by the inverse document frequency (IDF):

TF-IDF(t, d) = TF(t, d) * IDF(t)

  1. Term Frequency (TF) is calculated as the number of times a term (t) appears in document (d) divided by the total number of terms in that document:
    TF(t, d) = count(t, d) / total_terms(d)
  2. Inverse Document Frequency (IDF) measures the importance of a term across the entire document collection. It is calculated as the logarithm of the total number of documents (N) divided by the number of documents containing the term (n_t):
    IDF(t) = log(N / n_t)

By multiplying the term frequency with the inverse document frequency, the TF-IDF score gives more weight to terms that are frequent in a document but rare across the entire collection. This helps to filter out common terms and highlights the terms that are more informative and relevant to the query.

3. Vector Space Model

3.1. Definition and concept of Vector Space Model

The Vector Space Model (VSM) is an algebraic model used in information retrieval to represent documents and queries as vectors in a high-dimensional space. Each dimension in this space corresponds to a unique term from the document collection, and the value of a vector in that dimension represents the importance of the term in the corresponding document or query.

3.2. Representing documents and queries as vectors

To represent documents and queries as vectors, we first create a term-document matrix where each row represents a term and each column represents a document. The value in the matrix at the intersection of a term and a document represents the importance of the term in that document. In the case of the TF-IDF model, this value is calculated using the TF-IDF formula discussed earlier.

Similarly, a query is represented as a vector by transforming it into the same high-dimensional space as the documents. The value of a query vector in a dimension corresponding to a term is calculated using the same TF-IDF method.

3.3. Cosine similarity as a measure of document-query similarity

To measure the similarity between a document and a query in the vector space, we can use cosine similarity. Cosine similarity is a measure of the angle between two vectors and is calculated as the dot product of the two vectors divided by the product of their magnitudes. A cosine similarity of 1 indicates that the vectors are identical, while a cosine similarity of 0 indicates that the vectors are orthogonal (i.e., not related).

4. Building a TF-IDF and Vector Space Model

4.1. Data preparation and preprocessing

We will be using the same preprocessed data from the last article where we added a new column to our DataFrame to be used by our model. Here is the code:

def preprocess(text):
# Convert to lowercase
text = text.lower()

# Remove special characters and digits
text = re.sub(r'\W+|\d+', ' ', text)

# Remove stopwords and lemmatize
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])

return text

# Preprocess documents and queries
documents_df['preprocessed_text'] = documents_df['text'].apply(preprocess)
queries_df['preprocessed_text'] = queries_df['text'].apply(preprocess)

4.2. Implementing TF-IDF and Vector Space Model

  1. Create the TF-IDF matrix: We first create the term-document matrix using the TF-IDF weighting scheme. To do this, we can use the TfidfVectorizer from the sklearn library. This object is initialized and then fitted and transformed using the preprocessed text from the ‘documents_df’ DataFrame:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents_df['preprocessed_text'])

2. Implement the Vector Space Model: To implement the Vector Space Model, we define a search function that takes a query, the TF-IDF matrix, and the vectorizer as inputs and returns the ranked indices of the documents based on their cosine similarity to the query. The query is first transformed into the same vector space as the documents using the vectorizer, and then the cosine similarity between the query vector and document vectors is calculated. Finally, the document indices are ranked based on their similarity scores:

from sklearn.metrics.pairwise import cosine_similarity

def tfidf_search(query, tfidf_matrix, vectorizer):
query_vec = vectorizer.transform([query])
similarity_scores = cosine_similarity(query_vec, tfidf_matrix)
ranked_indices = np.argsort(-similarity_scores).flatten()
return ranked_indices

4.3. Evaluation of the TF-IDF and Vector Space Model

To evaluate the performance of the TF-IDF and Vector Space Model, we define an ‘evaluate_tfidf_model’ function that takes the queries DataFrame, documents DataFrame, vectorizer, TF-IDF matrix, and relevance information as inputs. For each query, the function calculates the precision, recall, and F1 score based on the number of true positives, false positives, and false negatives. The mean precision, mean recall and mean F1-score are then computed and returned as the evaluation results.

The TF-IDF model is responsible for assigning weights to terms based on their importance in documents.

The Vector Space Model is responsible for representing documents and queries as vectors and calculating the similarity between them.

# Evaluating the TF-IDF and Vector Space Model
def evaluate_tfidf_model(queries_df, documents_df, vectorizer, tfidf_matrix, relevance_info):
precision_scores = []
recall_scores = []
f1_scores = []

for query_id, query_info in relevance_info.items():
if query_id not in queries_df.index:
print(f"Query ID {query_id} not found in the queries DataFrame. Skipping evaluation for this query.")
continue

relevant_docs = set(query_info['relevant_docs'])
ranked_indices = tfidf_search(queries_df.loc[query_id, 'preprocessed_text'], tfidf_matrix, vectorizer)
retrieved_docs = set(ranked_indices[:len(relevant_docs)])

true_positives = len(retrieved_docs.intersection(relevant_docs))
false_positives = len(retrieved_docs - relevant_docs)
false_negatives = len(relevant_docs - retrieved_docs)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)

mean_precision = np.mean(precision_scores)
mean_recall = np.mean(recall_scores)
mean_f1 = np.mean(f1_scores)

return mean_precision, mean_recall, mean_f1
TF-IDF and Vector Space Model Evaluation Result

In our case, the evaluation results show a mean precision, mean recall, and mean F1-score of 0.07, indicating an improvement in search accuracy compared to the keyword search models.

5. Comparison with Keyword Search

5.1. Advantages and disadvantages of TF-IDF and Vector Space Model compared to keyword search

Advantages:

  • TF-IDF and Vector Space Model capture the semantic importance of terms in documents, which can lead to more relevant search results.
  • By considering term weights, the Vector Space Model provides a more nuanced representation of document content compared to simple keyword matching.
  • Cosine similarity used in the Vector Space Model allows for partial matches, which can be more forgiving and helpful in retrieving relevant documents even if the query terms are not an exact match.

Disadvantages:

  • TF-IDF and Vector Space Model can be more computationally expensive compared to simple keyword search methods, as they require calculating term weights and similarities between vectors.
  • The performance of TF-IDF and Vector Space Model may be limited in cases where the document collection has a very specific or narrow focus, making term weights less informative.

5.2. When to choose TF-IDF and Vector Space Model over keyword search

  • When the quality and relevance of search results are more important than computational efficiency.
  • When the document collection is diverse, and capturing the semantic importance of terms can improve search results.
  • When partial matches or similarity-based retrieval is desired over strict keyword matching.

6. Conclusion

6.1. Recap of the main points discussed in the article

In this article, we covered the basics of the TF-IDF model and the Vector Space Model, their advantages over simple keyword search methods, and their implementation using the Cranfield dataset. We also discussed the evaluation of the combined model and its performance in terms of precision, recall, and F1 score.

6.2. Importance of understanding more advanced information retrieval models

Understanding more advanced information retrieval models, like the TF-IDF and Vector Space Model, is crucial for building effective search systems that can deliver relevant and meaningful results. These models offer a more nuanced approach to document representation and similarity, which can significantly improve the search experience for users.

6.3. What to expect in the upcoming articles of the series

In the upcoming articles, we will delve deeper into other advanced information retrieval techniques such as Latent Semantic Indexing, Search Personalization, Collaborative Filtering, Incorporating Natural Language Processing, and Learning to Rank. We will explore how these techniques can further enhance search systems and provide even more relevant and personalized results to users.

--

--

Prateek Gaurav

Sr. DS Manager @ LGE | Ex - Amazon Data Scientist | Data Science Mentor | Boston University Graduate www.letsdatascience.com