RAG: Part 4: Indexing

Mehul Jain
9 min readApr 5, 2024

--

Indexing in terms of RAG is the process of organizing a vast amount of text data in a way that allows the RAG system to quickly find the most relevant pieces of information for a given query. It’s like building a super efficient library for your LLM, where instead of flipping through pages, it can instantly pinpoint the exact sections that address the user’s needs.

Photo by Petrebels on Unsplash

In the previous blogs, we have seen how we can chuck the large documents and convert them into some sort of embedding. In this blog, I will discuss the underlying algorithms that the various database systems use to store and retrieve a large amount of data with faster speed.

Before we start, let me answer a very basic question, why do we need specialised vector DBs, can’t we use standard relational DBs such as SQL

Performance Bottlenecks:

  • Indexing: Relational databases excel at indexing structured data based on specific columns. High-dimensional vectors, with many features, pose challenges for traditional indexing methods. Specialized vector databases use indexing techniques optimized for fast similarity searches.
  • Scalability: As the number of vectors grows in your RAG system, standard databases can struggle to keep up with retrieval speed. Vector databases are designed to scale efficiently with massive datasets.

Efficiency Considerations:

  • Storage: Relational databases might store vectors as generic data blobs, which isn’t optimized for space usage. Vector databases often employ compression techniques specifically for vector data.
  • Querying: SQL isn’t ideal for complex vector similarity queries. Vector databases provide built-in functionality for these types of searches, making them faster and more efficient.

Indexing

There are 2 factors which are considered while choosing which algorithm to use for a given problem — Accuracy and Speed.

If we need higher accuracy we might need to take a brute-force approach which will compare every stored vector with the query vector. Alternatively, we can use approximation techniques to increase the speed but at the cost of accuracy.

Interested in speeding up your searches? Here are two key strategies:

  1. Trim vector size: This can be achieved by employing dimensionality reduction techniques or minimizing the number of bits used to represent vector values.
  2. Narrow down search scope: Opt for clustering or structuring vectors into tree formations based on specific attributes, similarity, or distance. Then, focus your search on the closest clusters or sift through the most similar branches for efficiency.

Let’s deep dive into various algorithms which use either of the two approaches or a combination of both.

1. k-nearest neighbours

KNN (K-Nearest Neighbors) indexing is a brute-force technique used in the retrieval stage of RAG models.

Core Concepts:

  • Similarity Measure: A function (e.g., cosine similarity, L2 distance) that determines the “closeness” between two vectors. The retrieval process aims to identify passages/documents whose vector representations are most similar to the query vector.

Note: Use L2 distance when the magnitude of features matters (Ideal when raw values and their differences are important, like comparing user profiles based on age and income) otherwise use cosine similarity (Ideal when comparing documents based on word topics, regardless of the overall word count).

  • k Value: The number of nearest neighbours to retrieve for each query. A higher k value provides more diverse results, while a lower k value prioritizes the most relevant ones.

Search Process:

  1. Query Vectorization: The user’s query is transformed into a vector using the same method applied to the corpus.
  2. KNN Search: It retrieves the k nearest neighbour vectors from the corpus using given similarity metrics. Its time complexity is O(N log k) but using the KD tree for search will reduce the time complexity to O(log N)

2. Inverted File Vector

IVF indexing is an ANN technique used to accelerate the retrieval stage in RAG. It shares some similarities with KNN indexing but operates in a slightly different way.

Core Concepts:

  • Clustering: The vector space is partitioned into clusters using techniques like k-means clustering. Each cluster has a centroid, which represents the center of that cluster in the vector space.
  • Inverted File: An inverted file data structure (Just like a Python dictionary) is created. This file maps each centroid to a list of data point IDs (passages/documents) that belong to the corresponding cluster.

Search Process:

  1. Nearest Centroid Search: The IVF index efficiently searches for the nearest centroid (the cluster centroid vector most similar to the query vector) based on a similarity measure (often cosine similarity).
  2. Refined Search: Within the cluster identified by the nearest centroid, a smaller number of nearest neighbours (data points) are retrieved using a more expensive distance metric (like L2 distance). This step refines the search within the most promising cluster.

3. Locality Sensitive Hashing

LSH indexing serves to expedite the retrieval process. Unlike the previously discussed methods (KNN, IVF), LSH focuses on mapping similar data points to the same “buckets” with a high probability, even though it might not guarantee the closest neighbours.

Core Concepts:

  • Hash Functions: LSH utilizes a family of hash functions that map similar data points (represented as vectors) to the same “hash bucket” with a high probability. However, dissimilar data points might also collide in the same bucket.
  • Similarity Measure: A function (like cosine similarity) determines the “closeness” between two vectors. The LSH functions are designed to map similar vectors (based on the similarity measure) to the same bucket with a higher probability compared to dissimilar vectors.
  • Multiple Hash Tables: LSH often uses multiple hash tables, each employing a different LSH function. This approach increases the likelihood of finding similar items even if they collide in one table, as they might be separated in another.
Source of Image by Pinecone

Search Process Steps:

  1. LSH Hashing: The query vector is hashed using each hash function in the LSH family, resulting in a set of hash codes (bucket indices) for each table.
  2. Candidate Retrieval: Based on the generated hash codes, documents that share at least one hash code (i.e., potentially similar based on the LSH function) with the query in any of the tables are considered candidate matches.

4. Random Projection

RP indexing projects high-dimensional data (text vectors) into a lower-dimensional space while attempting to preserve similarity relationships.

Core Concepts:

  • Dimensionality Reduction: RP aims to reduce the dimensionality of textual data represented as vectors. High-dimensional vectors can be computationally expensive to compare.
  • Random Projection Matrix: A random matrix is generated, with each element containing random values (often following a Gaussian distribution). The size of this matrix determines the target lower dimension.
  • Similarity Preservation: The goal is to project data points (vectors) onto a lower-dimensional space in a way that retains their relative similarity as much as possible in the high-dimensional space (Binary form). Just like SVM, when the hyperplane normal vector produces a +ve dot-product with another vector, we encode it as 1 else 0. This allows for efficient search in the lower-dimensional space while still capturing relevant connections.

Search Process Steps:

  1. Random Projection: Both the query vector and the document vectors in the corpus are projected onto the lower-dimensional space using the pre-computed random projection matrix. This results in lower-dimensional representations of the data (binary vector).
  2. Search in Lower Dimension: Efficient search algorithms (Hamming distance to find the closest match) are used in the lower-dimensional space to find documents whose projected vectors are most similar to the projected query vector. This search can be faster due to the reduced dimensionality.

5. Product Quantization

PQ indexing is used to accelerate the search process and reduce memory footprint.

Core Concepts:

  • Vector Decomposition: High-dimensional query and document vectors are decomposed into smaller sub-vectors representing lower-dimensional subspaces. This effectively breaks down the complex vector into simpler pieces.
  • Codebook Creation: For each subspace, a codebook is created using techniques like k-means clustering. This codebook contains a set of representative centroids, each representing a group of similar sub-vectors.
  • Encoding: Each sub-vector is then “encoded” by identifying the closest centroid in its corresponding codebook. This encoding process assigns an index to the sub-vector based on its closest centroid.
Source of Image by Auther

Search Process Steps:

  1. Vector Decomposition: The query vector is decomposed into sub-vectors according to the pre-defined subspaces.
  2. Subspace Encoding: Each sub-vector is encoded by finding its closest centroid in the corresponding codebook, resulting in a set of indices representing the encoded sub-vectors.
  3. Approximate Distance Calculation: Using the encoded sub-vector indices from both the query and document vectors, an efficient distance metric is applied to estimate the similarity between the two vectors.

6. Hierarchical Navigable Small World

HNSW excels at finding data points in a large collection that are most similar to a given query, but it might not pinpoint the absolute exact closest neighbour. This trade-off between perfect accuracy and retrieval speed makes HNSW ideal for applications like RAG indexing, where returning highly relevant information quickly is crucial.

Core Concepts:

  • Navigable Small World Graphs: HNSW builds upon the concept of navigable small world graphs. Imagine a social network where everyone is connected to a few close friends but also has a few random connections to people further away in the network. This allows for efficient navigation — you can usually reach any person within a small number of hops by strategically following close and distant connections. HNSW translates this concept into a graph structure where data points (vectors) are nodes, and connections represent their similarity.
  • Skip list: Similar to the Linked list data structure, HNSW introduces a hierarchical structure to the graph. This means there are multiple layers, each with a different “grain size” for connections(Skip Connection). The top layer has far fewer connections but spans larger distances in the vector space, allowing for quick exploration. Lower layers have more connections but focus on finer-grained similarity. This hierarchy enables efficient search — the algorithm starts at the top layer for a broad initial search and then progressively refines in lower layers to find the nearest neighbours. But for this, we need to build a sorted skip list.
Source of Image by Auther

Search Process:

  1. Top-Layer Exploration: HNSW leverages the long-distance connections in the top layer to identify a small set of potentially promising nodes (candidate nearest neighbours).
  2. Hierarchical Descent: The algorithm iteratively explores these candidates in lower layers, using shorter-distance connections to refine the search and get closer to the true nearest neighbours.
  3. Selection: Throughout the search, a pre-defined number of nearest neighbours are selected based on a distance threshold or other criteria.

There are a few key challenges encountered while indexing in RAG systems:

  • Data Quality and Relevance: The quality of retrieved information heavily depends on the quality of the indexed data. Inaccurate, incomplete, or irrelevant data in the index can lead RAG models to generate inaccurate or misleading responses.
  • High dimensionality of Text Embeddings: As the embedding dimensionality increases, distance calculations between vectors become less meaningful, making it harder to identify truly similar data points.
  • Balancing Speed and Accuracy: Finding the perfect balance between retrieval speed and accuracy is crucial. RAG systems require fast retrieval to generate responses in real time. However, overly aggressive filtering techniques might exclude relevant information. Indexing methods like HNSW offer a good compromise, but fine-tuning these methods for optimal performance can be challenging.
  • Limited Evaluation Metrics: There is a lack of standardized metrics specifically designed to evaluate the quality of retrieved information in RAG indexing. Current metrics often focus on retrieval speed or recall (number of relevant items retrieved) but might not adequately capture the true relevance of the information to the user’s query.

Addressing these challenges is crucial for building robust and effective RAG systems that can leverage the power of large language models to deliver accurate and informative responses.

Thanks for spending your time on this blog. I am open to suggestions and improvements. Please let me know if I missed any details in this article.

--

--