From Binary to Buddy : Introduction to RAG and Fine-Tuning

Muhammad Ihsan
The Deep Hub
Published in
10 min read4 days ago
Photo by Bernd 📷 Dittrich on Unsplash

When we start implementing Artificial Intelligence in the form of Large Language Models (LLM) into an application, we will encounter at least two challenges. The first is how to connect the LLM with ‘real data’. It is essential to “ground” the LLM or link it with relevant real data to function properly. The second is how to provide the right data. The LLM must be able to deliver accurate and relevant information according to business needs.

RAG (Retrieval-Augmented Generation) is a technique that combines retrieval capabilities with text generation capabilities to provide more accurate and relevant answers. Meanwhile, fine-tuning is the process of adjusting an existing LLM model with more relevant specific data to improve performance and accuracy in providing answers. RAG and fine-tuning are currently the best methods available for implementing AI in the real world. Both help LLM deliver more accurate and relevant results under various conditions. In this article, we will discuss these concepts in full to help understand how RAG and fine-tuning work and their benefits in applying AI in the real world.

How LLM Works

To understand the concepts of RAG and fine-tuning, we first need to understand how LLM works. When we interact with LLM through applications like ChatGPT, it might sometimes feel like talking to a human. We ask a question, the LLM gives an answer, and the conversation can continue. Such interaction might give the impression that the LLM is smart and capable of thinking like a human. However, sometimes the LLM’s answers can be inaccurate or wrong, even though the LLM seems very confident in its response. In the context of LLM, this is commonly referred to as “Hallucination.”

An LLM applied in a chat application is essentially an advanced auto-complete system. When we start a new chat, the LLM already has an initial message called a “system message” behind the scenes. This point serves as the starting point for a conversation with the LLM. So, for example, when we ask a question like, “How do I make fried rice?” the LLM first reads the predefined “system message,” then reads our question, and then completes the conversation by providing an appropriate answer.

In reality, the LLM does not search for information about fried rice in a database before answering. Instead, the LLM uses a highly complex multi-dimensional map of human language to complete the document. Since hundreds or even thousands of fried rice recipes are in the training data, and many fried rice recipes are merely variations of the same basic ingredients, the LLM will most likely produce an accurate list of ingredients for a general fried rice recipe.

Therefore, it should be understood that an LLM cannot be considered “smart” in the true sense. The ability to answer questions behind the scenes is actually a complex mathematical calculation. So, even though the LLM seems to have human-like thinking abilities, it is merely the result of a very advanced auto-complete process.

The Importance of Context when Interacting with LLM

One important thing to know when starting to use LLM is that providing context to the LLM can significantly influence the outcome.

Whether we realize it or not, providing context in a conversation is crucial when interacting with LLM. In everyday conversations, we always provide context without much thought. For example, when wanting to make fried rice, the question asked is not just, “What ingredients are needed to make fried rice?” We usually be more specific, like “What ingredients are needed to make Indonesian-style fried rice?”

So, for example, when we ask ChatGPT, “What ingredients are needed to make Indonesian-style fried rice?” the term “Indonesian-style” becomes the important context. The LLM will then read all previous information, from the initial message to our request and the new context we added, so it can provide a more specific and relevant answer.

The process of providing the right context in hopes of getting relevant answers from LLM is known as “Prompt Engineering.” The more context we provide, the greater the likelihood that the LLM can deliver the desired response. Therefore, this is also why when we ask the LLM to summarize a document or video transcript, the result is usually good. By providing context, the LLM can generate responses that make us feel like it has truly read and summarized the content as requested.

Understanding RAG: Retrieval Augmented Generation

As mentioned earlier, when using LLM products like ChatGPT, we often think of them as a combination of a search engine and a highly intelligent virtual assistant. However, LLMs are not actually information retrieval systems. LLMs are representations of a complex system called a “language transformer.” LLMs don’t truly “know” anything. They are sophisticated enough to complete most sentences correctly, as long as the information pattern is available in their training data.

The problem arises when the information provided by the LLM is actually incorrect, but the given answer seems highly confident. In such cases, the LLM appears to be hallucinating or even lying.

To make LLM answers more accurate, we can use the RAG approach. This approach improves the accuracy of LLM answers by combining information retrieval (retrieval) with the LLM’s ability to complete and generate text (generation). So instead of answering directly, the LLM first searches for information from a database, adds it to the context, and then provides the answer. An early example of this concept is the Bing AI system. When we ask a question, the system first searches for relevant information on the web, then uses that information as a reference to provide an answer. Of course, the application of this approach is not limited to search engines but also applies to other LLM systems.

RAG (Retrieval Augmented Generation) Workflow

So how can large language models (LLMs) retrieve data from existing information sources? Let’s look at the RAG workflow in simple terms (hopefully 😅).

1. Incoming Request:

When a request or question comes in, the LLM first understands the meaning of the request. For example, when we ask, “What ingredients are needed to make layered cake?” the LLM converts this question into a query that the related data system can understand.

2. Accessing Data Sources:

This query is then sent to relevant data sources, which can be standard databases accessed via APIs, embedding arrays, vector databases, or even recipe books in PDF format. We call these data sources the “grounded truth.”

3. Retrieving Data:

The related data source processes this request and returns relevant data to the LLM.

4. Generating an Answer:

The LLM then combines the original request with the newly retrieved data and generates an answer. For example, if the data retrieved by the LLM is a recipe from an Indonesian recipe book, the LLM will use this information to provide an accurate list of ingredients.

5. Additional Verification:

Since LLMs are probabilistic (generating answers based on probability) and non-deterministic (the result can differ at each step), the accuracy of the answers might not always be precise. To reduce this risk, a verification step is often added.

The generated answer, along with the retrieved data, is returned to the LLM to ensure that the answer only contains information from the retrieved data. This step can be repeated several times to improve the accuracy of the answer.

Embeddings: Helping AI Understand Data

Source: wikimedia.org

Returning to the issue of fried rice, let’s say we create a ChatBot and use an Indonesian recipe book as the data source. Then we ask, “What is the recipe for making fried rice?” Rice is a staple food in Indonesia and is naturally a common ingredient in many recipes. So how does the system know that what we want is a fried rice recipe, not chicken rice, soup rice, grilled rice, or kebuli rice?

To find the relevant related information, AI uses a technique called Embeddings. Here’s how embedding works:

  1. Numerical Representation:

Embeddings transform text into numerical representations in the form of points and vectors in a multi-dimensional space.

2. Embeddings Process:

Each recipe in the recipe book is fed into the embedding model, which then converts the words in the recipe into a map of points in the multi-dimensional space. Here, similar words, like “rice” and “sticky rice,” are placed closer together, while dissimilar words, like “spinach” and “carrot,” are placed further apart.

3. Creating Vectors:

Vectors are drawn on each generated point to represent the relationships between words in the text. The mathematical representation of these points and vectors becomes the embeddings for each recipe.

4. Matching Requests with Embeddings:

When you ask a question like, “How do I make chicken fried rice?” the AI system will create embeddings from this question using the same embedding model. The AI system then compares the question embeddings with the embeddings of each recipe to find the best match.

5. Generating an Answer:

By comparing embeddings, the AI system can find the recipe that best matches our request. For example, if your question matches the embeddings for the Indonesian fried rice recipe, the AI system will provide the text of the Indonesian fried rice recipe from the database.

Knowledge Graphs

Imagine we are cooking and want to know what ingredients are needed. For example, we ask, “What sauce do I need to make fried rice?” If the AI system only relies on ‘embeddings,’ it might return two answers: ‘Soy Sauce’ and ‘Barbecue Sauce’. Clearly, ‘soy sauce’ is the correct answer, not ‘barbecue sauce’. Why can AI be wrong?

  1. The Same Words, Different Meanings: In our language, many words have more than one meaning depending on the context. For example, ‘paste’ in ‘tomato paste’ with ‘paste’ in ‘toothpaste’ are certainly different.
  2. Limitations of Embeddings: While embeddings convert words into numbers (vectors) that allow computers to understand and compare words, embeddings cannot always understand context well.

To overcome this problem, we can use ‘knowledge graphs.’ These graphs consider not only the meaning but also the relationships between words. For example, in a knowledge graph, ‘tomato paste’ would be connected with ‘food,’ and ‘toothpaste’ with ‘hygiene.’ So when we ask, “What paste is good for making spaghetti?” the knowledge graph will understand that we are talking about food and only provide ‘tomato paste’ as the relevant answer.

However, not all conditions require knowledge graphs. AI systems usually combine various data sources such as traditional databases, APIs, embeddings, and knowledge graphs, depending on the needs and types of data. Generally, it is recommended to start with the simplest method, such as using a database with an API, and then try more advanced options as the problem becomes more complex.

Fine-Tuning

RAG is indeed one of the effective methods to improve the quality of LLM answers. However, it is not the only one; another method that can be done is fine-tuning. This method is one way to improve the performance of AI models to provide more specific answers that suit our needs. The way it works is, first, we provide the AI model with many complete dialogue examples. Each example contains a system message, user request, and expected response. The AI model is then trained with these examples, so it learns how to respond to similar requests in the future.

Fine-tuning has several benefits. There are at least three main benefits of fine-tuning. First, it provides more consistent answers because it already knows how to respond. Second, the model can be customized to answer with a specific language style. Third, the model can be trained to perform specific actions, such as providing references to documents or directing to a database.

Although promising, fine-tuning also has several challenges, including:

  • A lot of training data examples are needed to succeed.
  • The process of training the model needs to be done repeatedly for the model to follow the desired pattern.
  • Even after training, the model may not give the desired result, requiring a lot of trial and adjustment.

Fine-tuning is very useful in situations where consistent answers are needed. With fine-tuning, we can have direct control over the AI model’s behavior, something that would be difficult to achieve if only using the base model.

RAG with Fine-Tuning = RAFT

RAFT stands for Retrieval Augmented Fine-Tuning. This method combines two techniques to create a more efficient and accurate AI system in handling specific data, such as internal company documents. This is possible because RAG (Retrieval-Augmented Generation) allows AI to find relevant information from a wide range of data sources. Then, with Fine-Tuning, AI is also trained to provide answers that match specific examples. Here is an example workflow of RAFT:

  1. Step One: Using RAG

We use RAG to train the AI model with specific data. For example, internal company documents so the AI recognizes patterns in that data.

2. Step Two: Create Specific Questions

Create a list of questions that reflect the types of questions that users of the system might ask.

3. Step Three: Retrieve Relevant Documents

Use RAG to retrieve two types of documents:

  • A: Documents that contain the correct answers.
  • B: Documents that contain irrelevant information.

4. Step Four: Combine Questions and Documents

Combine the questions with Document A to create detailed answers, including step-by-step explanations.

5. Step Five: Fine-Tuning Process

Conduct fine-tuning using three types of data examples:

  1. Question + Document A + Detailed answer.
  2. Question + Document B + Detailed answer.
  3. Question + Mixed data + Detailed answer.

After training, the result is that the model will learn to use relevant data and ignore irrelevant data, determining which data can be relevant for each question.

Conclusion

In the application of Artificial Intelligence, especially with Large Language Models (LLMs), there are various challenges that need to be addressed to provide accurate and relevant answers. Techniques such as RAG (Retrieval-Augmented Generation) and Fine-Tuning play a crucial role in overcoming these challenges. By using these techniques, we can create an AI system that is not only sophisticated but also more reliable in providing relevant and accurate answers according to needs. The implementation of RAG, fine-tuning, and RAFT in AI systems helps overcome the limitations of the base model and ensures that AI can make significant contributions in various real-world applications. Thank you for reading this article, happy learning!

--

--