Mastering Information Retrieval: Building Intelligent Search Systems (Chapter 3)

Prateek Gaurav
8 min readApr 1, 2023

--

Chapter 3: Latent Semantic Indexing: Uncovering Hidden Relationships

Reference Chapter: “Evaluating Information Retrieval Models: A Comprehensive Guide to Performance Metrics”
Chapter 1: “Keyword Search: The Foundation of Information Retrieval”
Chapter 2: “TF-IDF and Vector Space Models: Beyond Keyword Search”
Chapter 3: “Latent Semantic Indexing: Uncovering Hidden Relationships”
Chapter 4: “Word2Vec and Doc2Vec: Capturing Semantic Relationships with Distributed Representations”
Chapter 5: “Transformers in Information Retrieval: Leveraging Pre-trained Models for Enhanced Search”
Chapter 6: “Incorporating Natural Language Processing: Enhancing Search with Language Understanding”

Google Colab File: Coming Soon…
Dataset: Cranfield Collection Data

1. Introduction

In previous chapters, we discussed keyword search and TF-IDF, which are fundamental techniques for information retrieval. While these methods are effective, they often struggle with capturing the underlying semantic relationships between words and documents. Latent Semantic Indexing (LSI) addresses this issue by uncovering hidden relationships between terms and documents in a dataset. It enables search engines to better understand the underlying concepts in the documents and queries, making the search process more accurate and robust.

LSI uses a mathematical technique called Singular Value Decomposition (SVD) to reduce the dimensionality of the term-document matrix, which is typically large and sparse. By reducing the dimensionality, LSI captures the most important patterns and relationships between terms and documents, effectively creating a more compact representation of the dataset. This low-dimensional space is often referred to as the “latent semantic space,” as it reveals the underlying semantic structure of the data.

2. Singular Value Decomposition

Singular Value Decomposition (SVD) is a linear algebra technique that decomposes a matrix into three smaller matrices, allowing us to capture the most important features and relationships within the data. Mathematically, given a matrix A (m x n), SVD decomposes it into three matrices: U (m x r), Σ (r x r), and Vᵀ (r x n), where r is the rank of the matrix A.

U and V are orthogonal matrices, meaning their columns are linearly independent and orthogonal (i.e., their dot product is zero). Σ is a diagonal matrix containing the singular values, which represent the importance of each corresponding latent factor. The columns of U are called left singular vectors, while the columns of Vᵀ are called right singular vectors.

In the context of information retrieval, matrix A is the term-document matrix, where each row represents a term, and each column represents a document. The goal of SVD is to reduce the dimensionality of this matrix, retaining only the most important patterns and relationships between terms and documents. This is achieved by truncating the matrices U, Σ, and Vᵀ to a smaller dimension k (k < r), which is the number of latent factors to be retained.

The reduced matrices can be used to represent both terms and documents in the latent semantic space, where terms are represented as vectors in the U matrix and documents are represented as vectors in the Vᵀ matrix. This low-dimensional representation allows for more accurate similarity comparisons between terms and documents, as it captures the underlying semantic relationships that are not apparent in the original term-document matrix.

3. Building an LSI Model

3.1 Data Preparation and Preprocessing

Before implementing LSI, it’s crucial to preprocess the text data. This involves tokenization, stopword removal, stemming or lemmatization, and other relevant techniques. In addition to preprocessing, you will need to represent your documents as a term-document matrix. One common approach is to use the TF-IDF weighting scheme to build this matrix, which takes into account both term frequency and inverse document frequency. We will be using the preprocessed data we have from previous articles and we will create the Tf-IDF matrix using this data:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents_df['preprocessed_text'])

3.2 Implementing Latent Semantic Indexing using SVD

Once the term-document matrix is constructed, we can apply Singular Value Decomposition (SVD) to reduce its dimensionality. The matrix is decomposed into three smaller matrices: U, Σ, and Vᵀ. Then, by selecting a value for k (the number of latent factors), you can truncate these matrices to obtain a reduced representation of the dataset in the latent semantic space. This involves keeping only the k largest singular values in Σ and the corresponding columns in U and Vᵀ.

The choice of k is a critical parameter in LSI and can significantly impact the model’s performance. A higher value of k may capture more information, but it can also introduce noise and increase computational complexity. A lower value of k may produce a more compact representation, but it may lose important information. Experimenting with different values of k can help you find the optimal balance between information retention and noise reduction. (I tried with 100, 300, and 500, I got better results with 300 compared to 100 and the same result as 300 with 500, so I decided to choose 300)

#Create the Tf-IDF matric using preprocessed data
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents_df['preprocessed_text'])

# Perform SVD on the TF-IDF matrix
U, s, Vt = svds(tfidf_matrix, k=300) # k is the number of topics, which can be adjusted as needed
S = np.diag(s)

# Project the documents and query into the reduced latent semantic space
lsi_matrix = tfidf_matrix.dot(Vt.T)

def lsi_search(query, lsi_matrix, vectorizer, Vt):
query_vec = vectorizer.transform([query])
query_lsi = query_vec.dot(Vt.T)
similarity_scores = cosine_similarity(query_lsi, lsi_matrix)
ranked_indices = np.argsort(-similarity_scores).flatten()
return ranked_indices

Here’s a detailed line-by-line explanation of the code:

  1. vectorizer = TfidfVectorizer(): Create an instance of the TfidfVectorizer class, which will be used to convert the preprocessed text data into a term-document matrix with TF-IDF weights.
  2. tfidf_matrix = vectorizer.fit_transform(documents_df['preprocessed_text']): Fit the vectorizer to the preprocessed text data in the 'documents_df' DataFrame and transform the text into a term-document matrix with TF-IDF weights.
  3. U, s, Vt = svds(tfidf_matrix, k=300): Perform SVD on the TF-IDF matrix, decomposing it into three matrices: U, s, and Vt. The parameter 'k' specifies the number of topics (latent dimensions) to keep, which can be adjusted based on your needs.
  4. S = np.diag(s): Convert the singular values in the 's' vector into a diagonal matrix 'S'. This matrix represents the strength of each topic in the latent semantic space.
  5. lsi_matrix = tfidf_matrix.dot(Vt.T): Project the documents into the reduced latent semantic space by multiplying the TF-IDF matrix with the transpose of the Vt matrix.
  6. def lsi_search(query, lsi_matrix, vectorizer, Vt):: Define a function 'lsi_search' that takes a query, the LSI matrix, the TF-IDF vectorizer, and the Vt matrix as input arguments. This function will be used to search for relevant documents based on the query.
  7. query_vec = vectorizer.transform([query]): Transform the input query into a TF-IDF vector using the same vectorizer that was fit to the document text data.
  8. query_lsi = query_vec.dot(Vt.T): Project the query vector into the reduced latent semantic space by multiplying it with the transpose of the Vt matrix.
  9. similarity_scores = cosine_similarity(query_lsi, lsi_matrix): Compute the cosine similarity between the query LSI vector and the LSI matrix, which contains the LSI vectors of all documents.
  10. ranked_indices = np.argsort(-similarity_scores).flatten(): Sort the document indices in descending order based on their cosine similarity scores with the query LSI vector.
  11. return ranked_indices: Return the ranked document indices as the output of the 'lsi_search' function.

3.3 Evaluating the LSI Model

To evaluate the LSI model, we will use again use the same metrics that we used for our last 2 models, precision, recall, and F1-score.

def evaluate_model(queries_df, documents_df, vectorizer, matrix, relevance_info, search_function, Vt=None):
precision_scores = []
recall_scores = []
f1_scores = []

for query_id, query_info in relevance_info.items():
if query_id not in queries_df.index:
print(f"Query ID {query_id} not found in the queries DataFrame. Skipping evaluation for this query.")
continue

relevant_docs = set(query_info['relevant_docs'])
if Vt is not None:
ranked_indices = search_function(queries_df.loc[query_id, 'preprocessed_text'], matrix, vectorizer, Vt)
else:
ranked_indices = search_function(queries_df.loc[query_id, 'preprocessed_text'], matrix, vectorizer)
retrieved_docs = set(ranked_indices[:len(relevant_docs)])

true_positives = len(retrieved_docs.intersection(relevant_docs))
false_positives = len(retrieved_docs - relevant_docs)
false_negatives = len(relevant_docs - retrieved_docs)

precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)

mean_precision = np.mean(precision_scores)
mean_recall = np.mean(recall_scores)
mean_f1 = np.mean(f1_scores)

return mean_precision, mean_recall, mean_f1\

mean_precision, mean_recall, mean_f1 = evaluate_model(queries_df, documents_df, vectorizer, lsi_matrix, relevance_info, search_function=lsi_search, Vt=Vt)
print(f"Mean Precision: {mean_precision:.2f}")
print(f"Mean Recall: {mean_recall:.2f}")
print(f"Mean F1-score: {mean_f1:.2f}")
LSI Model Evaluation Result

4. Comparison with TF-IDF and Vector Space Model

Based on the evaluation results, we can observe that the LSI model with k = 300 resulted in a mean precision, recall, and F1-score of 0.08. Although this may not seem like a significant improvement over the TF-IDF and Vector Space Model, it’s important to note that the LSI model has the potential to capture more nuanced semantic relationships between terms and documents that are not apparent in the original term-document matrix. Experimenting with different values of k and adjusting other model parameters may lead to further improvements in the LSI model’s performance.

5. LSI Model: Advantages, Disadvantages, and When to Choose

5.1 Advantages and Disadvantages of the LSI Model compared to TF-IDF, Vector Space Model, and Keyword Search

Advantages:

  1. Semantic understanding: LSI can capture hidden relationships between terms and documents by uncovering the underlying semantic structure of the data, which is not possible with a keyword search or the vector space model based on TF-IDF.
  2. Noise reduction: LSI can help reduce the impact of noise, such as synonymy and polysemy, by projecting the data into a lower-dimensional space. This allows the model to better capture the true relationships between terms and documents.
  3. Robustness: The LSI model is generally more robust to the presence of irrelevant or redundant features in the term-document matrix, as it captures the most significant patterns in the data.

Disadvantages:

  1. Complexity: The LSI model can be computationally expensive due to the need to perform Singular Value Decomposition on large matrices, especially when compared to simpler models like keyword search or TF-IDF.
  2. Parameter tuning: Choosing the optimal value of k (the number of latent factors) and other parameters can be challenging, and requires experimentation to achieve the best performance.
  3. Interpretability: The reduced-dimensionality representation produced by LSI may be difficult to interpret, as it does not directly correspond to the original terms and documents.

5.2 When to Choose the LSI Model over Keyword Search, TF-IDF, and Vector Space Model

The LSI model is best suited for situations where capturing semantic relationships between terms and documents is crucial for the task at hand. Some scenarios where LSI may be preferable include:

  1. When dealing with large and diverse datasets: LSI’s ability to uncover hidden semantic structures can be beneficial for large and diverse datasets, where keyword search or TF-IDF may struggle to find relevant documents.
  2. When the text contains many synonyms, polysemes, or ambiguous terms: LSI’s noise reduction capabilities can help address issues caused by synonyms, polysemes, and ambiguous terms, improving the relevance of search results.
  3. When integrating with other models: The LSI model can be combined with other information retrievals techniques, such as search personalization, collaborative filtering, natural language processing, and machine learning-based ranking, to create more advanced and accurate search systems.

In summary, the LSI model should be considered when there is a need to uncover deeper semantic relationships within the dataset or when other methods like keyword search, TF-IDF, or vector space model fail to deliver satisfactory search results.

6. Conclusion

Latent Semantic Indexing is a powerful technique for uncovering hidden relationships between terms and documents in information retrieval systems. By reducing the dimensionality of the term-document matrix using Singular Value Decomposition, LSI can capture the underlying semantic structure of the data, potentially leading to more accurate search results. While the performance improvement might not always be substantial, LSI can complement other information retrieval techniques to create a more robust and effective search system.

--

--

Prateek Gaurav

Sr. DS Manager @ LGE | Ex - Amazon Data Scientist | Data Science Mentor | Boston University Graduate www.letsdatascience.com