The context length of prompting is becoming longe and Gemini also published a paper to show off its performance using long prompt instead of RAG. Is RAG dead? This video and its slides may give you some hints.
Prompting, RAG and Fine-tuning
The video published by OpenAI talks about three ways to improve the performance of LLMs:
- Prompt engineering
- Retrieval-augmented generation
- Fine-tunning
Firstly, you should always start with prompting since
- You can test and learn early compared with other methods
- When paired with evaluation, it provides your baseline and sets up further optimization.
However, prompting is not an efficient way to minimize token usage when you need new information and long instructions.
Fine-tuning is a good way to internalize instruction and minimize token usage. Especially when you use RAG and put lots of new information into the prompt, the token space is precious.
RAG is like short-term memory. Fine-tuning is like long-term memory of a specific structure, style, or format that you need the model to replicate. During fine-tuning, do not just drop all the data in and wait for the results. Please just drop 50 or 100 pieces of data and observe the improvement. If fine-tuning is going to work, this small size of data is enough to give a signal. The high-quality small dataset is better than the average large dataset.
Another scenario where fine-tuning is effective is reducing cost and/or latency by replacing a more expensive model like gpt-4o
with a fine-tuned gpt-4o-mini
model. If you can achieve good results with gpt-4o
, you can often reach similar quality with a fine-tuned gpt-4o-mini
model by fine-tuning on the gpt-4o
completions, possibly with a shortened instruction prompt. ( Knowleage Distillation, our next topic :) )
RAG 2.0 by Contextual AI
RAG uses frozen off-the-shelf models for embedding, a vector database for retrieval, and a black-box language model for generation, stitched together through prompting or an orchestration framework.
RAG 2.0 optimises the language model and retriever as a single system end-to-end. Contextual AI claims that RAG 2.0 performs better in long contextual length compared with naive RAG.
Contextual AI compares Contextual Language Models (CLMs) with frozen RAG system across a variety of axes. To be honest, the improvement is not very significant. Let us compare the improvements posted in OpenAI prompting engineer video. Before we apply RAG 2.0, which is a big black box and computationally expensive and time-consuming, we may try to customise our RAG system by different techniques mentioned in the OpenAI video and observe how the system improves in the process. It is like prompting gives a flavour before fine-tuning.