Data Centric Approach to Finetuning LLM’s

9 min readJan 26, 2024

A lot has been said about LLM models and fine-tuning them online. However, not much attention has been given to curating fine-tuning datasets — a topic that is both underrated and crucial. As the old adage in machine learning goes, ‘Garbage in, Garbage out’. The quality of data is paramount to any machine learning/deep learning project. I have noticed people spending a significant amount of time experimenting with larger and larger models without critically examining their data. Even before large language models (LLMs) gained popularity, a wide variety of tools were being developed around the data-centric ecosystem. Notably, AI veterans like Andrew Ng emphasized the importance of data-centric machine learning long before the LLM revolution. In this article, I aim to provide you with an overview of how to approach fine-tuning your LLM in a data-centric manner. While this article is by no means exhaustive in exploring all possibilities, it serves as a starting point. This article assumes you are familiar with the basics of deep learning.

The concept of data-centric fine-tuning for Language Models (LLMs) is rooted in utilizing a sufficiently complex model capable of learning the required task. Instead of expending excessive effort in the pursuit of the best model, the focus is on dedicating the majority of your time to acquiring the right data, enabling you to attain optimal results. Assuming you have a complex enough dataset for your task, you can accomplish a significant amount with ample high-quality data, thereby minimizing the need for extensive time spent on model selection. The diagram below illustrates the overall steps involved in data-centric fine-tuning.

I will illustrate the process of gathering a dataset for a given task. Let us take an example of the process of doing finetuning a LLM for food reviews QA.

Define your task clearly

The first step is to clearly define your task and determine the purpose of the model. Considerations such as speed, accuracy, and memory optimization should be outlined. For example, let’s consider a Food Question and Answering (Q&A) system within a food reviews platform like DoorDash. The primary purpose is to enhance user experience. Should the system prioritize speed? Yes. Is memory optimization a critical factor, considering it runs on the cloud? Not necessarily. Does it need to be accurate? Yes, but primarily for a limited set of queries. Unlike accommodating a vast query set, such as those handled by systems like ChatGPT or Grook, the focus is on precision within a specific range of queries.

There are different QA variants based on the inputs and outputs:

Extractive QA: The model extracts the answer from a context. The context here could be a provided text, a table, or even HTML! This is usually solved with BERT-like models.
Open Generative QA: The model generates free text directly based on the context. You can learn more about the Text Generation task on its page.
Closed Generative QA: In this case, no context is provided. The answer is completely generated by a model.

Source of text: Huggingface

Some examples of extractive and open generative QA

Extractive QA

Context:
The capital of France is Paris. The Eiffel Tower is a famous landmark located in the city.
Question:
What is the capital of France?
Extractive Answer:
The capital of France is Paris.

Open Generative QA

Context:
A cat is curled up on a sunlit windowsill, peacefully dozing off.
Question:
Describe the cat on the windowsill.
Generative Answer:
A cat, peacefully dozing, rests on a sunlit windowsill.

For our specific problem, Open Generative QA proves to be the most suitable approach. In this method, given a context and a question, we rephrase the context in a manner that is user-friendly. This task goes beyond a simple extractive QA, as our goal is to present the context in a way that resonates with the customer. Closed Generative QA is not employed in our case because we aim for the algorithm to utilize dynamic data that is updated daily. This approach eliminates the need to train the model every day based on the latest reviews.

Understand your Data Universe

In this step, define all the types of data required. Specify the possible kinds of questions you aim to have answered and outline how the context should be structured. In a generative Question and Answering (Q&A) Language Model (LLM) task, the input involves a context and a question to produce an answer. Therefore, your training data should consist of tuples containing the question, context, and corresponding answer. It’s essential that your training data closely resembles the types of questions you intend to serve the user.

Define all possible questions and use cases. For example, lets say the scope of questions you would want to serve are:

Does [specific restaurant] offer [particular food] on its menu?
Is [specific restaurant] known for serving [specific cuisine]?
Can I bring my dog to [specific restaurant]?
Does [specific restaurant] serve alcoholic beverages?
Are there vegan options available at [specific restaurant]?
What are the vegetarian options at [specific restaurant]?

Now, don’t just gather data for the intended behaviour, but also gather data for how you want your model to handle unintended behaviour.

For instance, when asked, ‘Does the restaurant ‘Good Japan’ have a car park?’ and the Indexing algorithm mistakenly retrieves a comment about good Japanese food, the Language Model (LLM) should learn to respond by acknowledging the lack of required information. The next section will delve into Indexing, the algorithm responsible for retrieving the appropriate context for a given question. In scenarios where the context lacks an answer to the question, it is crucial for the LLM to generate a response indicating its lack of information, rather than fabricating or providing inaccurate details. While improving the design of your indexer is a solution, it’s equally important to prepare your LLM for negative surprises. The negative examples are also very important pieces of information as it provides more clear signals for your algorithm to recognize a pattern. They are very informative for your model to learn, as it provides more information to distinguish what a correct answer/wrong answer is. From my anecdotal experience, this has provided much better performance when I was curating datasets.

Also add several variations of a single question such as without punctuation, with punctuation, etc.

Ensure to define your data universe as exhaustively as possible. Remember your model can only handle the kind of questions it has seen during training.

Curate a dataset

Having established our dataset universe, proceed to collect data in the specified format. Follow standard machine learning practices such as creating train/test/validation sets and pre-processing steps.

Find relevant pre-trained model

The advantage of utilizing a pretrained model trained on large datasets in is its ability to rapidly learn patterns without required large datasets. The model inherently understands several synonyms and relations as it has already encountered such connections in tasks for which it was previously trained. Analyzing the underlying dataset on which your pretrained model was trained will provide insights into its existing knowledge base

Select a model pretrained model that closely matches your task.

For Open Generative QA, we have two steps in the pipeline:

Indexing: A model that picks the right context for a given question. Its not feasible to have your entire corpora as context because of the limitation of length of inputs your model can handle. Its also more efficient to do it via a Indexing mechanism.
QA Model: The model that uses the context to answer a particular question. The model we are finetuning.

For indexing, a straightforward approach involves using an embedding model to calculate similarity between the question and embeddings of all available contexts, utilizing frameworks like sentence transformers. I won’t delve into extensive details about indexing, as my primary focus is on developing the QA model.

We can use a model as simple as GPT2 for the task. Lets try the pretrained model for the task. The default pretrained model does not work out of the box for the task.

After reviewing the description from the Hugging Face Model Hub below, it becomes apparent that the model is unlikely to be explicitly trained for the task we require. Hence, finetuning the model is necessary.

Choose your metrics

Given that you have responses, how would you quantitatively define how good the response is? quantitatively evaluating models are important for any system since its a systematic manner of scalably evaluating them. It also allows us to choose loss functions that are more aligned to our intended objective. That is the LLM learns the way we want it. While, I won’t dive deep into this topic but its a very important step because the whole process of developing LLM’s in a data centric manner requires a good evaluation metric since collecting right data for LLM requires you to ensure what’s incorrect/correct is being measured right. That is not just that you have a system that quantitatively is doing it right/wrong, but you are algorithm has a sense of whats right/wrong. A common metric for QA systems is F1-score. For some tasks, the responses might require human subjectivity to evaluate such as evaluating test cases written by code generator, gauging if jokes generated by LLM is funny or if a response is not racial or harmful. For such cases, you can have a look at RLHF.

Finetune your model

With the curated dataset and chosen metrics, fine-tune your LLM model using the standard training process, just like any other deep learning model.

Evaluate and Gather more data

Now evaluate your model using quantitative metrics established above. Look for examples on which your model did well and did not do well. Are there some kind of patterns you notice? The data did better on certain kinds of questions or worse on other kinds of questions? Now gather more data relevant data for the kind of questions your model did not do well. Although, you have tried to define your data universe as exhaustively as possible, there are always cases which we are likely to miss on. There may be instances which where covered, but yet the algorithm failed to grasp the correct patterns.

If necessary, consider building a tool for dataset curation and inference. If you lack expertise in web development, you can utilize tools like ChatGPT or other Generative AI tools to create a straightforward interface. I have personally created a basic tool, as demonstrated below.

It’s a straightforward annotation webpage where you input a question and context to obtain an answer. For more complex projects, there are good open-source and commercial solutions available. If the answer generated by the Language Model (LLM) on the webpage is unsatisfactory, you have the option to update it with the appropriate answer and include it in your dataset. These newly labeled instances will be utilized for fine-tuning the model in the next iteration.

Repeat this process several more times until you achieve satisfactory performance. In this tool, I haven’t included indexing because, in the application I was developing, the LLM and indexer were used together for choosing context. Therefore, I also ensured that the LLM was capable of functioning independently.

Here are some articles that could give you deeper insights into performing-evaluation: Andrej Karpathy’s article and my article.