E19 : RAFT - Retrieval Augmented Fine Tuning

Praveen Thenraj

Published in

Research Papers Summarized

6 min readMar 24, 2024

Paper Name : RAFT: Adapting Language Model to Domain Specific RAG

Paper URL : https://arxiv.org/abs/2403.10131

Authors : UC Berkely - Tianjun Zhang Shishir G. Patil Naman Jain Sheng Shen Matei Zaharia Ion Stoica Joseph E. Gonzalez

Please find annotated paper here.

Problem Statement :

Using LLMs for domain specific tasks might require fine-tuning them so as to align them to domain style or require RAG kind of process to include the latest knowledge when responding to questions.
But only fine-tuning the models might result in missing out the new knowledge that that is available post the fine-tuning of these models.
But on the other hand, using only RAG might not bring out the learning ability of the model that can be achieved during fine-tuning process.

Solution :

Fine-tuning a LLM using domain specific training data and using the tuned model along with the reference knowledge (in context learning — RAG) as context to generate answers in COT style helps to improve the performance of LLMs on domain specific data.
Unlike conventional training data for domain specific training, RAFT uses training data that includes even documents that do not contain answers.

Approach :

Supervised Fine-Tuning of a selected LLM with the domain specific data as the training data.
The training data used for supervised fine-tuning consist of a question (Q), set of documents (D1,D2,…Dn) as context, answer (A) and CoT style reasoning explanation (A*) for the answer.
Here the more relevant documents containing the answer to the question are called ‘oracle documents’ and the documents not containing the answer to the question is called ‘distractor documents’.
A data point might contain one or more oracle documents (D*) and remaining distractor documents (Dn - D*)
Training data is chosen such that x% of training data contains the oracle documents in the context, whereas remaining (1-x%) off training data do not contain oracle documents as context, rather they contain only distractor model as context.

x% of data: Q + D∗ + D2 + . . . + Dn → A∗
(1 − x) % of data: Q + D1 + D2 + . . . + Dn → A∗
This setup helps the model to learn to refer the context and generate response and as well learn to identify if responses are not present in the context.
The CoT style answer(A*) is generated using a LLM which uses the question, context (D1,…,Dn) and answer (A). The CoT style answer (A*) generated contains keywords ##begin_quote## and ##end_quote##. The context between these keywords in A* contain citations from the context (Dn) that lead to arriving at the answer (A)
The CoT style answers are generated gives additional details to the model to learn the reason and way for arriving to a particular answer.

Experimental Setup :

LLM tested - LlaMA2–7B-chat model
Baselines used to compare against RAFT
1. LlaMA2–7B-chat model
2. LlaMA2–7B-chat model with RAG
3. LlaMA2–7B (Domain specific fine-tuned) - DSF
4. LlaMA2–7B (Domain specific fine-tuned with RAG) - DSF+RAG
Datasets used for fine-tuning and testing
1. Natural Questions, TriviaQA, HotpotQA - open domain Q&A mainly focused on common knowledge
2. PubMedQA - biomedical research question and answers
3. Torch Hub, Tensor Flow Hub, HuggingFace - API dataset
LLM for generating CoT style answers for training data - GPT-4–1106

Observations :

Comparing Llama-2–7B-chat model with Llama-2–7B-chat model(with RAG), results clearly show that including RAG helps in answering domain specific questions better than just an instruction tuned model.Anyhow the results were not even any close to a huge and better model like GPT-3.5 combined with RAG

Using domain specific fine-tuned (DSF) LlaMA2–7B model trained particularly on these datasets without any distractor documents during training phase and no RAG during test time showed comparatively better performance than LlaMA2–7B with RAG.

It can be seen that DSF was able to perform better than GPT-3.5+RAG in 3 out of 5 datasets which is good given the difference in size of these models. This observation signifies the need for fine-tuning a model on domain specific data rather than just using RAG approach to identify and generate responses from in-context.
Anyhow combining DSF with RAG did not improve the performance either, though it was expected to as it has been trained on domain data and at the same time has access to knowledge base (RAG) as well.

LlaMA2–7B Vs LlaMA2–7B+RAG Vs DSF Vs DSF+RAG Vs RAFT

In fact, plain DSF performed better than DSF+RAG. This shows that when a plain fine-tuned model is combined along with RAG, theplain fine-tuned model lacks ability to learn the in-context content and extracting useful information from it. Hence a special approach in fine-tuning these models are required before combining them with RAG approach.
RAFT performed better than any other baselines and also performed better than GPT-3.5+RAG in 4 out of 5 datasets.

Compared to DSF, RAFT accuracy increased by 13.6, 28.9, 12.94, 0.01 and 0.3 on PubMed, HotpotQA, HuggingFace, Torch Hub, TensorFlow Hub respectively
Comparison between DSF+RAG with RAFT, clearly shows that changing the approach of model tuning, improve the performance of models significantly.
Interesting observation was the performance of RAFT compared to a bigger and better model GPT-3.5 with RAG. On HuggingFace dataset the accuracy increased by 44.92.
Ablation studies were conducted to understand the importance of including CoT style answer in the training data to help improve the model performance. Results clearly demonstrate adding explanations to the answer during training phase improved the model perfomance at test time in providing more accurate responses.

Also, experiments were conducted to understand what % of training data should contain oracle documents (golden truth documents that contain more relevant responses).
Results indicate that fine-tuning with only 20% of training data containing oracle documents as part of the context and remaining 80% containing only distractor documents as context, the models were still able to gain significant improvements in performance.

Accuracy of models trained using only x% of data containing relevant documents in context

An experiment on identifying the optimal number of distractor documents (D=D*-Dn, eg, D* + 1D, D* + 2D or D* + 3D) to be considered during the fine-tuning process was done. The results showed that the results were better whenever the model was trained using 2–4 distractor documents. Hence 1 oracle document along with 4 distractor documents were used as part of the training phase.

Number of distractor documents considered Vs Accuracy Vs Number of retrieved documents using RAG

Conclusion :

Fine-tuning a model only with domain specific question and answers (DSF) might not provide access to new knowledge (as like in RAG). And even including RAG under such scenario (DSF+RAG) might remove the model ability to extract relevant information from context and hence not improve the performance.
Hence an approach that combines a novel way of domain specific fine-tuning (including golden truth documents only in certain parts of training data and using only distractor documents as context for remaining training data) and RAG to improve the performance on domain specific tasks.
The approach clearly indicates, a proper fine-tuning strategy along with correct domain data and RAG can help even comparatively smaller LLMs perform better than bigger LLMs.

E19 : RAFT - Retrieval Augmented Fine Tuning

Written by Praveen Thenraj