Development of a Retrieval Augmented Generation Pipeline

Published in

CONTACT Research

5 min readMay 2, 2024

What DALL.E thinks Retrieval Augmented Generation looks like.

Large Language Models (LLMs) nowadays can generate text about any topic, in any language, and the results are coherent and grammatically correct. One might be deceived by this and assume that LLMs are omniscient assistants. Large enterprises would love to use LLMs as such: Helpful assistants which can answer every question truthfully, and not only about general topics, but questions about their internal data.

But actually, LLMs have trouble with factual knowledge. If an LLM doesn’t know the answer to a question, it often hallucinates an answer that looks plausible, but is actually not correct.

One very popular approach to overcome this issue is Retrieval Augmented Generation (RAG). This technique combines a search engine with a generative model, so that the produced output is always based on a source document. RAG avoids expensive fine-tuning of the LLM on internal data. Instead, it uses the LLMs in-context learning skills and provides all necessary information in the prompt.

What is RAG?

A RAG pipeline consists of two major components: A Retriever and a Generator. The retriever accepts a user query and searches through a database to find relevant information, which is then passed to the generator (LLM) to craft an accurate response.

The retriever is relatively independent of the generator. One could just use a simple keyword search to retrieve relevant documents, but experience shows that semantic (dense vector) search mostly outperforms keyword search. We have already implemented a semantic search, read about it in our previous blog post!

The RAG Pipeline: Our Experience

Our AI research team has been developing a RAG prototype for the documentation of our software suite CONTACT Elements. We want to share some of the details and learnings with you.

On the surface, we have a user interface where the user can ask questions and gets a list of relevant links as well as an answer to their question. The answer references sources, so the user can easily check out the source document.

Behind the scenes, several stages are necessary:

Preprocessing: This initial step involves breaking down the data sources into manageable chunks for the semantic search. Decisions need to be made about the length of these chunks, the overlap, whether to split data mid-sentence or keep sentences whole. Handling multimedia content like images, videos, and tables is also a key consideration.
Embedding Model: The chunks are then embedded using a text embedding model. The choice of the embedding model heavily influences the preprocessing step, since different embeddings models are optimized for different kinds of input chunks. Another consideration is data privacy: Commercial models (e.g., OpenAI, Cohere, Google) can often only be used via an external API. We have tested both commercial model and open source models on local hardware.
Vector Database: The embedding vectors are stored in a vector database. The choice of the vector database is less relevant, since most vector databases have very similar features and performance. We selected Apache Solr for the simple reason that Solr is already used in our existing software stack, so it will be easier to implement the RAG search into our products.
Reranking: After retrieving results from the vector database, the search results might not be in the perfect order. There can be irrelevant results, and the best search result might be not at the top of the list. There are different reranking techniques, ranging from traditional methods like TF-IDF or BM25 to modern LLM-based approaches. Cohere even offers models that were trained specifically on the reranking task.
Prompt Generation: Once the results are reranked, they are used together with the original user query to create prompt for the generator LLM. The prompt takes a form like “Answer the following question using the given sources. Question: […] Sources: […]” but it can be refined a lot to tweak the output to deal better with missing information, citing sources correctly and so on.
Generative Model: Several models are available, from commercial models like GPT-4 to specialized RAG models. Interestingly, generating accurate responses does not always require large models; even smaller, localized models can be effective, as demonstrated by our tests with models like llama3–8b.

Evaluation

So far, we have not settled on a certain set of parameters. Instead, we have built a pipeline where we can quickly test many different parameter combinations for preprocessing, embedding, reranking and generation. The pipeline also includes two evaluation stages, for evaluating both the retriever and the generator.

To evaluate the retriever, we use a set of test questions labeled with correct documentation links. By passing the test questions to the retriever, we can automatically compute metrics like recall, precision, and nDCG.

Evaluating the generator is less straightforward since the correctness or helpfulness of an answer cannot easily be measured. Here, we employ the RAGAS framework, which uses LLMs as judges to measure metrics such as context relevance (how relevant is the given context for the user query?) or answer faithfulness (how well is the generated answer based on the given prompt?).

Next steps

Our next goal is to deploy this improved RAG prototype within our company, collect feedback, and optimize the pipeline based on these insights. We believe that this continuous improvement cycle is crucial for improving the quality of the prototype.

There are still many open questions about improving the quality of the search results, using advanced RAG techniques and integration of the assistant into our PLM software. Stay tuned for further updates!

About CONTACT Research. CONTACT Research is a dynamic research group dedicated to collaborating with innovative minds from the fields of science and industry. Our primary mission is to develop cutting-edge solutions for the engineering and manufacturing challenges of the future. We undertake projects that encompass applied research, as well as technology and method innovation. An independent corporate unit within the CONTACT Software Group, we foster an environment where innovation thrives.

Development of a Retrieval Augmented Generation Pipeline

What is RAG?

The RAG Pipeline: Our Experience

Evaluation

Next steps

Written by Leif Sabellek