Data Indexing in the Retrieval Augmented Generation(RAG) Pipeline

Understanding the power of data indexing and mapping in different sources to enhance the results from the pipeline.

Published in

Towards Generative AI

4 min readAug 2, 2023

Data collection and indexing are fundamental components of the Retrieval-Augmented Generation (RAG) pipeline. In the RAG approach, relevant documents or passages are first retrieved from a large dataset. These responses will in-turn support the generation of high-quality responses. So we will focus on the Data Indexing Process of the RAG pipeline shown below.

In this blog, we will discuss how Elastic Search, Apache Solr, and Watson Discovery can enable you to achieve the responses in this pipeline quickly.

Parsing Engine:

In the RAG pipeline for private data sources, we collected data in different types of file formats, such as HTML, PDF, web pages, etc. This data is regarding IBM and IBM products. To extract the data from PDFs, we used open- source PyPDF2 python library and some preprocessing steps for HTML pages to get readable content. This content can be stored in databases with metadata information.

1. Data Indexing using Apache Solr:

Apache Solr is an open-source platform that provides distributed indexing and search capabilities. We can enrich data by using the configuration of NLP libraries while creating an index. Here, we stored content and it’s metadata. By using the below code, you can index the data at SOLR.

We used the Python pysolr package to establish a connection with SOLR. The solr() method is used to establish a connection. We created a JSON document in order to index the data. The add() method of Solr allows us to index all of the data.In order to improve the search-ability of the data, we configured NLP libraries to eliminate stop words and characters during indexing.

2. Data Indexing using Elastic Search:

We utilized the free and open-source Elasticsearch Python module to build an index and index_mapping for elastic search. We generated a connection and index for elastic search by using the code below.

In order to index the data in Elastic Search, we then parsed the information from the data collection and built an index mapping in JSON format. We can index the data by utilising the index() method of the ElasticSearch library.

3. Data Indexing using Watson Discovery:

Watson Discovery is a cloud-based powerful tool that provides information retrieval. It can parse any type of file data and store it in a database. It provides Advanced Text Analytics, Machine learning, Content mining, and efficient search capabilities. By using Watson Discovery UI, we uploaded the data and store it in WD. To read more about how can we upload and apply all capabilities to data read here.

Conclusion:

Data Indexing plays a critical role in any RAG model pipeline. All LLM models have some limitations regarding the data. It depends on which data were trained on. So, by using indexing, we can provide another set of data to the LLM model to provide a generic answer.

We can store a huge amount of datasets with their metadata by using Elastic Search, Apache Solr, and Watson Discovery. These tools offers pre-built NLP pre-processing, text analysis, and content mining features that enable quick pipeline retrieval and data enrichment.

The next stage in the RAG pipeline is to implement a retriever to get the relevant document. Learn more about it in our next blog.

Follow Towards Generative AI for latest technical content related to AI.