Discovering Insights: Exploring Information Retrieval in a RAG System

10 min readMar 20, 2024

By: Anaghaa Londhe, Ankita Patil, Monisha Gnanaptakasam, Harsh Patel, Members of Scientific Staff, DISH Wireless

Motivation

The Retrieval Augmented Generation (RAG) system relies on information retrieval as its central processing unit. Just as the performance of a computer system hinges on the capacity of its CPU, the effectiveness of the RAG system relies on the performance of its information retrieval module. Therefore, the design and implementation of the Information Retrieval system are crucial elements that substantially influence both the functionality and performance of the RAG system.

The information retrieval module within the RAG system consists of a vector database containing the essential data for answering questions. It includes a searching algorithm, known as similarity search, to retrieve the desired information. Furthermore, there exists a prompt engineering segment responsible for providing context to the Large Language Model (LLM).

This approach of the RAG system is very useful when a chatbot is designed to cater the needs within the purview of an organization. Despite using LLMs trained on toxic content from the web, the RAG system mitigates this toxicity in the responses it produces by providing the context. This method enables the confidential information to be stored within the organization, thus ensuring a more secure way of using LLMs. This blog gives a brief overview of the components such as Vector database and indices, similarity search algorithms, context injection and prompt engineering of the Information Retrieval module within a RAG system. There are different components in RAG, we have covered storing embedded data, similarity search and prompt engineering.

Fig1: Retrieval Augmented Generation System

Storing Embedded Data — Vector Database, Indexing

The vector database contains large volumes of text chunks embedded into high dimensional vectors using an LLM. These vectors serve as the indexing mechanism for storing data in the vector database. Each text chunk is assumed to carry semantic meaning, representing a contextual meaning as a whole. The high dimensional vectors can be compared mathematically, to find the common semantic meaning.

The contents could be varying from a single paragraph to multi-paragraph pages. These cannot be stored as a whole as multiple pages can contain similar information and a single page could contain distinct non-overlapping contents. Splitting the contents to smaller text chunks will enhance the possibility of grouping similar contents across multiple pages and also eliminate the overflowing of tokens while using the Embedding Model.

Vector Database

How is it different from a relational database?

Relational Databases are used to store structured data that is stored in tables that consist of columns and rows. Vector databases are used to store unstructured data and their vector embeddings. Both database types are optimized to store different types of data, hence they also differ in the way they store the data and retrieve it.

In a relational database, we are usually querying rows in the database where values are retrieved by exact keyword matches. In contrast, vector databases are capable of semantic understanding of the search term; it doesn’t rely on retrieving relevant results based on exact matches. Some vector databases also support hybrid search where it does both the lexical search (finding the exact matches of words in the databases) + semantic search.

indexes are used to store embedding of the data we want to search. In vector databases there are vector indexes, which play an important component when it comes to similarity search, but how are these different from flat indexes?

We’re glad you asked…

Flat indexes are the type of indexes where the vectors are not altered while they are being fed into the index. They have perfect search quality but take a higher amount of search time. It is also ideal to use flat index for similarity search when we have a considerably smaller dataset to compare our query against.

Vector indexes are used to store keys that are vectors. They are particularly designed to retrieve vectors that are similar to a particular query vector. They depend on mathematical calculations to identify the similar vectors. They can definitely be effective when we are querying it against larger datasets. There are different vector indexing types and we used the knn-index. Below we will discuss the impact of KNN Indexing.

Impact of KNN Indexing

KNN (K-Nearest Neighbors) indexing in vector databases is a method to efficiently organize high-dimensional data points for fast retrieval of nearest neighbors.

KNN indexing structures streamline search processes by arranging data points in a way that expedites the retrieval of closest neighbors. Unlike traditional brute-force methods, which falter as dataset sizes grow, KNN indexing structures offer sublinear or logarithmic search complexity. Moreover, by precomputing and structuring data points into an index, KNN indexing diminishes the search time needed to locate nearest neighbors for a given query vector. This efficiency proves indispensable in applications demanding real-time or near-real-time responses, such as recommendation systems and content-based image retrieval.

The HNSW (Hierarchical Navigable Small Word graph) is a faiss-method index which helps in Hierarchical proximity graph approach to Approximate k-NN search. It has better query latency and better quality response even though it is memory intensive. The greater the value of ef_construction the better the result.

FAISS library is used as it is efficient for similarity search and clustering of vectors, it is a great option for optimizing indexing throughput.

Similarity Search — Cosine, Euclidean

At its core, Similarity Search is about identifying and retrieving data points within a dataset that are most similar to a given query. This process is pivotal in fields ranging from machine learning and data mining to recommendation systems and beyond, enabling systems to find patterns, make predictions, and offer recommendations with remarkable precision. Similarity Search is not just a process but the backbone of the retrieval mechanism in RAG, ensuring the augmentation process is efficient and effective. Imagine talking with a machine that can instantly pull in most relevant information from the entire web or from special-interest databases to answer your queries. This is the power of Similarity Search, which allows the RAG models to deliver not just accurate but recent and relevant data-informed answers.

Quantifying the similarity between pairs of data entities, such as images, text, or multidimensional vectors, is the central topic of Similarity Search. Of these measures, Euclidean distance and Cosine similarity are the most emphasized, mainly because of their broad applicability and easily understandable interpretations.

Euclidean Distance: The Most Natural Measure of Proximity

The Euclidean distance is also largely referred to as the straight-line distance between any two points in space — that is, the two objects in question. This makes it ideal for applications where it simplifies interpretability — geometric distance between data points is monotonically correlated with similarity. From clustering users on buying behaviors to identifying similar images from a database, it can all be done straightforwardly by Euclidean distance.

Cosine Similarity: When Measuring Directional Closeness in High-Dimensional Spaces

In simple terms, cosine similarity is a measurement of the cosines between two vectors — therefore, it is not a magnitude but an orientation measuring. Most useful are the properties in text analysis, document retrieval, and in any domain where the direction of the data vectors — e.g., to express word frequency counts — is more interesting than their size. Cosine similarity is useful in areas such as natural language processing, where cosine similarity is used to gauge the similarity of two documents based on their content and not on the length of the document.

Euclidean distance and cosine similarity are effective parts in the retrieval of most relevant documents or data points, which can add up in the process of prompt or completion generation in a RAG setup. For example, to illustrate it in practice within RAG, take a simplified setting when we have a small dataset of documents or embeddings of these documents, and we want to retrieve the most relevant document given a query.

Taking the example of a RAG model injected with Context to answer questions using a set of pre-computed embeddings. These could be replacements for, let’s say, FAQ responses, abstracts of scientific articles, or descriptions of products. Here, one searches for the embedding of a query that is closest to that of the most relevant document.

Context Injection using Prompt Engineering — HyDE applied

Context injection is achieved through Prompt Engineering. The Top-k results from similarity search are then encapsulated as the context to be provided to the Large Language Models (LLMs). The LLM needs to be informed that the interaction is a dialogue. The format of dialog (for python) is of the form,

The text input contains the instructions and context that has to be provided to the LLM to generate response. The user question is contained in the question part of the dialog.

The vectors of the chunks in the multi-dimensional vector space may not be close to the vectorized user query and hence does not always provide a high similarity search score. In order to enhance the similarity search, the user query is enhanced before it is used in similarity search.

There are various methods to enhance query. The method chosen for enrichment is Hypothetical Document Embeddings (HyDE) as it is found to outperform many retrievers such as Contriever[1]. HyDE provides instructions to the LLM through zero-shot instructions to create a hypothetical document that includes the instructions, the background of the contents in the RAG system, and the steps to process the information and formulate an enhanced query mentioning the token limits like the following prompt:

The enriched query along with the context retrieved from Similarity Search is then passed to another hypothetical documented prompt where instructions on answering the user query are provided . The prompt looks like the following:

This method of query enrichment has proven to be more efficient than regular query enrichment techniques as the query is dynamically modified and has been provided with sufficient information to obtain more relevant context and hence the accuracy of the response is also enhanced.

Considerations

During the process of creating a RAG system, we identified a large scope of improvements:

Though the HyDe approach retrieves better results in most cases, there are scenarios where a general template for the background of the chain of thought processes cannot work.
The HyDe approach can increase the latency as the prompts in the Hypothetical document could be large and the LLM takes time to process the prompt and generate the enriched query.
The accuracy of the responses of a RAG system is not consistent.
The system is susceptible to jailbreaking[4]. A simple scenario is that it can still answer from its training data while totally ignoring the context and the instructions provided to it.
Similarity Search using KNN and FAISS indexing and Cosine Similarity algorithm is not the most efficient in terms of latency.
It is difficult to isolate the LLM response or the vectorization from its training data. If an LLM is fine tuned to the actual context that is provided, the vectors produced by the embedding model will be more precise and the results that are generated will be more accurate.
The existing architecture can be outperformed by the use of nmslib[5], FAISS for better queries/sec, nmslib can be used during hybrid search as it can be used for both Approximate Nearest Neighbor Search and Exact keyword search.

Conclusion

Through this experimentation, we have identified the scope of improvements that can be implemented as the next steps:

To improve the quality of similarity search results, hybrid search is a viable option where the vectorized search is accompanied with keyword based search. This would result in quicker and more accurate results being retrieved to provide context to the LLM.
The toxicity of the training data is reflected in the tone of the response the LLM generates. The vocabulary of the toxic content however is minimized by the use of the RAG approach.
Crafting prompts to extract the best possible responses from LLM is a creative space that is being explored. There are several other ways the prompts could be constructed so as to improve the quality of the response.
Another approach to improve the similarity search in the RAG system can be done by using multiple indices for the context; these can be considerably smaller indices which are focused on subtopics. This will help aid in lesser search time and the probability that the context retrieved from another subtopic is less.
Evaluation of RAG system as a measure of f1 score, accuracy, sentiment analysis, etc.

Reference

[1] HyDE Precise Zero-Shot Dense Retrieval without Relevance Labels

[2] k-NN Index — OpenSearch

[3] HNSW — PineCone

[4] Jailbreaking

[5] nmslib

About the Authors:

Anaghaa Londhe is a Data Science Engineer with almost 2 years experience in the Data field. She is passionate about Cloud computing, Data Analytics and Data Engineering. She has worked on various data products where she has contributed in ideating and building data pipelines and Cloud Infrastructure.

Ankita Patil is a Data Scientist II at DISH Wireless and a former Electrical Design Engineer. Her passion lies at the intersection of Analytics, Data Science, and Machine Learning. She has contributed significantly to building enterprise-level data products for DISH’s 5G Network Technology.

Monisha Gnanaptakasam is a Telecom Engineer turned Data Scientist with a rich background in creating minimum viable products in both the telecommunications (5G) and data science domains. She is passionate about continual learning and driven by the transformative shifts in the Telecommunication industry, she is an inquisitive explorer navigating the data-driven realm.

Harsh Patel serves as a Data Scientist II at Dish Wireless, where he specializes in data analysis, statistical modeling, and machine learning. With a robust background in business strategies and problem-solving, Harsh effectively utilizes a variety of data storage and manipulation tools, showcasing his technical versatility and strategic acumen.