How to Process Documents for RAG (Retrieval-Augmented Generation) Chatbots

Timo Selvaraj
SearchBlox
Published in
2 min readMay 1, 2024

By combining the generation capabilities of large language models (LLMs) with a retrieval component typically using a vector or semantic search, RAG chatbots can provide informative and personalized responses backed by evidence from a supplied corpus of documents.

However, the performance of these models hinges on properly processing and indexing the document corpus for efficient retrieval. If the retrieval is a failure, then the chatbot responses will be incoherent to the users.

In this article, we’ll explore the key steps involved in preparing documents for RAG models.

Document Preprocessing: The Foundation

Before building indexes, documents must undergo preprocessing to structure the content for retrieval. This involves:

- Splitting documents into passages or chunks of readable text

- Removing boilerplate elements like headers and footers

- Cleaning text by stripping HTML tags, excessive whitespace, etc.

- Optionally adding passage titles or other metadata

Determining the optimal passage length is crucial, as shorter passages enable more granular retrieval but increase the index size and computational overhead. Based on the type of unstructured documents and complexity of the layouts, preprocessing may have to be performed multiple times to avoid misprocessing the documents which will lead to failure of the solution.

Building High-Performance Retrieval Indexes

With preprocessed passages, you can build inverted indexes or vector databases for fast lookup during inference. Popular indexing approaches include:

- BM25 Inverted Indexes: A classic sparse retrieval technique.

- Dense Vector Indexes: Leverage semantic passage embeddings for nearest neighbor search.

- Hybrid Search Indexes: Combine dense embeddings with keyword retrieval algorithms.

These indexes map passages to their original document IDs and offsets, allowing retrieval of full document text during inference.

Retrieval at Inference Time

When a user queries the RAG chatbot, the following steps occur:

1. The search query is embedded using the same model as the index.

2. Top-k nearest neighbor passages are retrieved via search.

3. Retrieved passages are concatenated and passed to the language model alongside the query.

4. The language model generates a response conditioned on the retrieved context.

Some RAG chatbots perform iterative retrieval, using the model’s previous output to query the index again and retrieve more relevant information. Enterprise Search platforms can handle the document preprocessing as well as the retrieval function very efficiently. Most platforms come with connectors to access the documents or databases to preprocess and index the data into the inverted, semantic or hybrid indexes.

By carefully processing documents and building efficient retrieval indexes, RAG chatbots can leverage external knowledge sources to provide more informative and substantiated responses, elevating the capabilities of chatbots and AI assistants.

--

--