Let’s tune RAG pipelines with Fondant

Published in

Fondant Blog

9 min readJan 15, 2024

Retrieval augmented generation (RAG) has quickly become the go-to architecture to provide large language models (LLM) with specific knowledge. It first retrieves relevant context from a knowledge base which is then fed to an LLM which generates an answer. To learn more about RAG, check out this blogpost.

While the concept of RAG is straightforward, optimising a custom setup requires days to find the right set of parameters and system configuration. Off-the-shelf solutions might be easy to set up a quick proof of concept, but their performance is usually insufficient for production usage since they are not adapted to the complexities of specific situations.

At the first OpenAI developer conference, OpenAI presented different techniques to maximise the performance of LLMs in RAG applications. They showed that by using several different methods in an iterative manner, we can more than double the accuracy of a RAG system.

The initial progress, from 45% to 65%, involved experimenting with various embedding models and chunking strategies across approximately 20 iterations. Expanding the approach to include additional techniques, such as changing the database or using advanced prompt engineering, extends the journey. Even after discovering an effective set of techniques and parameters, revisiting them is necessary when applying the model to a new knowledge base or a project with different data formats, tones, or languages.

Unique data requires a unique solution!

No panic — Fondant can help us!

Fondant is an open-source framework that makes data processing reusable and shareable. Key features include:

A hub of reusable data processing components
A local runner for easy local development and testing
Built-in caching to quickly iterate with multiple runs
Data lineage and explorer to inspect data evolution across pipeline steps and runs
Parallel processing out of the box, speeding up processing of large datasets especially
The capability to execute scalable production workloads on managed cloud services like AWS Sagemaker or Google Vertex AI.

Dive into our documentation to explore all features!

In this blog post, we leverage Fondant set up a RAG system and automatically find the data configuration.

To this end, we have created a sample Github repository containing a Fondant pipeline to index a custom knowledge base and a second fondant pipeline to evaluate the overall performance.

The evaluation is done by using a set of test questions. Each question is used to retrieve relevant context which is evaluated using another LLM or with ground truth answers.

To find the best overall configuration we can easily exchange specific parameters or even complete components of our Fondant pipeline and evaluate the results. The repository contains notebooks that help us to run this optimisation both iteratively and automatically.

🏗️ Building a RAG pipeline with Fondant

Theoretically, it is possible to use any type of database for retrieving relevant context information. However, vector databases and semantic search have become the go-to solution for RAG. Language models are employed to create embeddings or vector representations of the given text, with semantically similar text being clustered in the vector space. To retrieve relevant context, the question is embedded into the same vector space and nearby content is retrieved.

This process can be encapsulated in a Fondant pipeline:

Data Loading: Begin by loading text data. We leverage a dataset from HuggingFace for our minimal example, but we can use one of the other loaders available on the Fondant Hub to connect various data sources
Text Chunking: Break down the text data into smaller, meaningful parts, preparing it for the embedding step.
Text Embedding: Utilise language models to embed text data into vector representations, facilitating the search for relevant chunks.
Write to Vector Database: The final step involves writing the text and their embeddings into a vector database. In this example we use Weaviate, one of our favourite vector stores for RAG, however there are other integrations available on the hub.

Check out pipeline_index.py to see how the entire pipeline is constructed. We initialise a pipeline and add various reusable components to it.

The snippet below gives us a initial glimpse at Fondant’s pipeline interface:

...
# Initialize the pipeline
indexing_pipeline = Pipeline(
    name="indexing-pipeline",
    base_path="./fondant-artifacts",
)

# Create a data loading step using the reusable `load_from_hf_hub` component
text = indexing_pipeline.read(
  "load_from_hf_hub",
    arguments={
        "dataset_name": "wikitext@~parquet",
        "n_rows_to_load": n_rows_to_load,
    },
    produces={
        "text": pa.string(),
    },
)

# Apply the chunking transformation using the reusable `chunk_text` component
chunks = text.apply(
    "chunk_text",
    arguments={
        "chunk_size": chunk_size,
        "chunk_overlap": chunk_overlap,
    },
)
 ...

To follow along with this example, we need to ensure we have access to a running Weaviate database. For local testing, we can take advantage of the provided Docker setup in our GitHub repository.

Fondant provides a CLI that includes a command for locally executing of our pipeline. In the notebooks we have used the python interface to start a pipeline execution.

from fondant.pipeline.runner import DockerRunner
DockerRunner().run(pipeline)

Executing this command initiates a step-by-step launch of distinct Docker containers.

The usage of Docker for local execution adds some complexity, but it enables us to easily share components between users and to have a setup close to a production environment. If our pipeline works well locally, executing the exact pipeline on a managed service becomes seamless.

For deeper insights into the pipeline run, we could leverage Docker Desktop. It offers an overview of all containers and additional details on the execution, and enables monitoring of the resources utilised by Docker.

Once the pipeline has completed successfully, the embedded data will be stored in the Weaviate database. Weaviate comes with a UI that allows us to send requests to explore the ingested documents.

📊 Evaluate the results

At this point, we have successfully indexed data into the database, laying the foundation for constructing a RAG system to inquire about our custom data. However, the critical questions remain:

How effective is the system’s performance?
Will it identify the right documents and provide accurate answers to our questions?

The only way to gain clarity is through evaluation. Periodically conducting evaluations is ideal to ensure consistent performance measurement.

We have constructed a Fondant pipeline specifically designed to handle this evaluation process.

The evaluation pipeline comprises five components:

CSV Data Loading: Loads the evaluation dataset (questions) from a CSV file. We have to provide a test set with questions based on our own data. While actual questions used by the users of our application will work best, there are tools to help us generate questions from our data automatically.
Text Embedding: Embeds each question as a vector using a language model. This language model should be the same model used to embed the chunks.
Vector Store Retrieval: Retrieves the most relevant chunks for each question from the vector store.
Evaluation: Assesses the retrieved chunks for each question using a framework like RAGAS, measuring their proximity to the ground truth defined in the CSV file.
Aggregate Metrics: Gathers results at a pipeline level to offer a quick overview of retrieval performance.

Once again, we can employ reusable components for creating the pipeline. We can find the pipeline implementation in the repository, located in pipeline_eval.py, and the related code to execute the pipeline in the evaluation notebook.

A successful pipeline run will yield RAGAS evaluation metrics: context_precision and context_relevancy. RAGAS calculates the score using an LLM. The relevancy metric assesses how relevant the retrieved context chunks are.

Aggregate the evaluation results into a dataframe, which we can be viewed in the notebook. The results might resemble the following:

So far, so good!

These metrics provide a succinct indication of the quality of our current setup. However, as any data scientist or engineer knows, looking at the data is invaluable to improve the results.

That’s why Fondant provides a data explorer, which enables an inspection of the data at any step of our pipeline.

The notebook contains a cell which starts the data explorer:

from fondant.explore import run_explorer_app
run_explorer_app(base_path=BASE_PATH)

After executing the cell we can open the Fondant explorer under http://localhost:8501.

We can start looking at our data. For instance, how the text was chunked or which characters were cleaned.

Are the results not sufficient? No problem! Do it again!

The iterative nature of the process involves repeating it until the results align with the use case standards. This means we can begin experimenting with parameter changes, such as chunk size or embedding models. Additionally, consider swapping out entire components in our index pipeline to observe the impact each change has on the final result.

Fondant features a caching mechanism that accelerates the process. Only components without previous results will be executed. For example, if modifying only the preprocessing steps at the end of the pipeline and executing the entire pipeline again, as results from previous components already exist from the run before, only the last step will be executed. This feature aids in executing pipelines that were only slightly modified.

🚀 Scale it

Depending on the number of pipeline components, potential parameters for each, and our dataset size, a pipeline run may demand a significant amount of resources.

Fortunately, transitioning from a local experimental setup to a production-ready workload is possible out-of-the-box. Fondant offers different remote runner options, such as Vertex AI, AWS Sagemaker, and Kubeflow Pipelines.

If we aim to submit a pipeline run to Google Cloud Vertex AI, we can use the following command:

fondant run vertex <pipeline_ref> \
 --project-id $PROJECT_ID \
 --project-region $PROJECT_REGION \
 --service-account $SERVICE_ACCOUNT

We can leverage this feature to run multiple pipelines in parallel on a managed cloud service of your choice.

✨ Wrap it up!

Now we have the know-how to use the example pipeline and conduct a parameter search to identify the best parameters for a custom RAG setup. Perhaps this brings to mind the beginning of this post:

Unique data requires a unique solution!

The sample repository used in this blogpost should provide a solid starting point to get started with our own data. The flexibility of Fondant allows us to swap out entire components and modify component arguments based on specific needs. Feel free to take a look at the Fondant Hub, where you can find pre-built components that may suit your specific case. For instance, if we need to load data from text or PDF files instead of the Hugging Face Hub, we can seamlessly incorporate the reusable data loader component.

Even as a use case becomes more complex — for example we want to modify the chunks by removing empty lines or incorporate a sophisticated machine learning model that we have previously built to classify chunks based on text quality for indexing — we have the flexibility to construct a custom component. This custom component can seamlessly integrate with the reusable ones, providing us the versatility needed to address specific scenarios.

For more detailed instructions on creating custom components, refer to our documentation.

Thanks for reading! If you want to stay on top of the latest Fondant developments, or joining our community you can join our Discord community.

Let’s tune RAG pipelines with Fondant

🏗️ Building a RAG pipeline with Fondant

📊 Evaluate the results

🚀 Scale it

✨ Wrap it up!

📚 Resources

Written by Matthias Richter