6 Tips to Becoming a Master LLM Fine-tuning Chef

Darren Oberst
9 min readDec 18, 2023

--

LLM Finetuning Best Practices

How many of you have had the following interaction with a client, partner, developer or friend over the last six months?

Conversation #1 — “(exuberant) we are going to fine-tune our own LLM models from open source — we have just picked up {{Llama | Falcon | Mistral | RedPajamas}} and it looks pretty straightforward.”

Conversation #2 — “(discouraged) yeah, we tried that, and it didn’t work out. Training models is hard.”

We are big proponents of open source models — and believe that model fine-tuning is truly the “secret sauce” to deploy enterprise-grade open source models. Having said that, taking base foundation models and fine-tuning them for a particular domain or task is not a trivial process and does require a lot of expertise and skill to achieve the target model behavior.

Most “hello world” model training tutorials show how superficially easy it is to build a simple training loop and run a few forward and backward passes through a model. Often times, these tutorials are platform-specific, designed to show the “ML Ops” lifecycle power of a particular tool. Usually the code samples are remarkably simple and straightforward.

Surprisingly, there are very few tutorials with practical guidance on LLM fine-tuning best practices. It is not hard to put cookies in the oven and burn them — what is hard is to make the perfect batch of cookies!

We would like to fill that gap in tutorials with a few best practices that we have learned the hard way as we fine-tune LLMs. Unfortunately, there is no single short-cut or perfect universal approach — generally, to produce a good fine-tuned model takes a lot of hard work, multiple training runs, iterations on the datasets, hyper-parameter adjustments, and sometimes a little bit of luck to find a winning recipe. It can be a humbling, and at times, frustrating activity to work through iterations, debug issues, and keep moving a model towards the target behavior.

Here are a few areas of attention that we believe are most critical in our experience:

# 1- Start with a Clear, Targeted Training Objective

It is easy to skip over this step as boilerplate, but this is actually the first and most important step to a successful fine-tuning. Models are not “mind readers” and training is not “magic.” Before kicking-off a fine-tuning initiative, it is important to define the goals — what exactly is the desired behavior that we are looking to see in the model? Are the objectives realistic and do they map to a specific fine-tuning dataset?

What makes a good training objective?

Perhaps to state the obvious, LLMs are great language pattern learners. This is the single most important insight for good model fine-tuning. (Whether this represents real ‘intelligence’ or just the appearance of ‘intelligence’ is a topic for another day.).

Think of fine-tuning as teaching the model a specialized transformation function with an input and a target output based on that input. This is the core insight of “multiple shot” learning, that if you give a model a few examples of a ‘transformation pattern’, it is quite effective at emulating it on new examples. Fine-tuning is generally the process of providing one or more new specialized transformation patterns, with usually hundreds to thousands of examples on each specific pattern, that enable the model to fully adapt to learn that pattern and replicate it.

The most common pitfall that we see is viewing finetuning as a way to impart knowledge to a model, rather than teaching a specific transformation pattern or combination of language.

As an example, our work is focused on retrieval augmented generation workflows in financial services and legal use cases, and our training objectives are usually focused on:

Specific industry domain —e.g., financial, regulatory, legal;

Specific source material document type — e.g., complex business, financial, regulatory, contracts; and

Specific tasks/skills, e.g., critical reading comprehension, sophisticated extraction, fact-based analysis and summarization.

Within this specific domain, document type and task definition, for a particular project, we may look at a very specific training objective, such as multiple-choice classifications for contract terms, or sorting a list by largest-to-smallest, or providing answers in a targeted number of words or bullets.

These examples are illustrative, but our main point is: be specific and clear of what you are trying to achieve, and the more focused, the better.

At this point, it is also worth asking the threshold question: do you need to finetune the model, or is the behavior “out of the box” sufficiently aligned to the intended use case? If you are struggling to define the specific training objective and strategy, it may be a sign that fine-tuning is not really necessary — and better to skip fine-tuning, and focus on other facets of the LLM RAG workflow.

#2 — The Fine-tuning Dataset is the Value Creation. While it may go without saying, it is an important reminder: model fine-tuning is all about the dataset.

This is where the value will be created.

This is the heavy lifting.

This is the hard part.

And there is no substitute for rolling up your sleeves and getting your hands-dirty with building, curating, cleaning, and reviewing the fine-tuning samples.

The dataset is the set of instructions that will be translated into adjustments in the models parameters to learn to minimize the loss of the intended training objective. Subject the training samples to scrutiny. Do the training samples map to the training objective and cover a wide range of expected potential scenarios? This is a trial-and-error process, and can be extremely time-consuming.

At the outset, if the fine-tuning dataset is not well-designed at both high-level and in the details — and with sufficient breadth and depth of examples — it will be impossible to compensate with other steps in the process. It is also critical that there is alignment between Step #1 and Step #2 — and usually, iterations, as the training objectives should be narrowed and clarified based on the availability of applicable datasets. You can think of these two steps as the master chef getting the right ingredients together. If you don’t have good, high-quality ingredients, it is very difficult to move ahead in the process. Most of the time in the fine-tuning lifecycle should be spent on this step.

#3 — Model Hyper-Parameters — Learning Rate is the Key. Once you finally have your plan defined and ingredients assembled in steps 1 +2, it is time for training!

There are a lot of things that can go wrong in a training process, and many best practices guides. While some may disagree, in our experience, getting the learning rate wrong is usually the most common way that fine-tuning trainings go off track. Intuitively, the learning rate defines the size of the step that is taken when applying the “learning” from the gradient and back-propagating it through the model. We like to think of this as analogous to the oven temperature when cooking — if the temperature is too high, it is likely that you will burn the food. If the temperature is too low, sometimes, it does not get the texture that you expect or looks good on the outside but is soft and undercooked on the inside.

We find that most “base foundation model” papers provide fairly unclear (and sometimes inaccurate) learning rate guidelines for subsequent downstream finetunings of their base models. Sometimes, this is because their base training was happening with huge batch sizes and parallelization, and sometimes, because there simply was not a lot of experimentation done on how the how model would be used in fine-tuning.

A few useful “rules of thumb” for fine-tuning LR settings for GPT-based decoder models:

· LR — for fine-tune training, set the peak learning rate (LR) in a range around 1.0 X 1e-5, subject to a warm-up and gradual step-down decay over the training lifecycle. We rarely see a model that responds well to finetuning above 1.5–2.0 X 1e-5 at the ‘high-end’, and similarly a lower-end range of ~0.5 X 10e-5. This can be a trial-and-error experiment with most models, but this is the range where we typically see the best results. (Of course, there are exceptions, usually discovered through trial-and-error, and can be potentially offset by other hyper-parameter adjustments.) This also assumes relatively small batch sizes, compared to the base training, e.g., 2–16 per batch/gradient accumulation step per GPU.

· Warm-up — usually 3–10% of the training steps — different base models have different receptivity to “foreign” fine-tuning materials, often with different “wrappers” and without a warm-up period, it is easy to blow-up the model and lead to some form of catastrophic forgetting. A few hundred steps to warm-up usually does the trick, but even this can require some experimentation for different models, requiring more or less warm-up for optimal training.

•Trouble-Shooting — watch the model closely during training while at peak LR, and if you see steady increases in loss during this period, then it is a good sign that the peak LR is too high, and needs to be adjusted downwards.

· Decay / step-down through training — we have not found any special difference with particular decay formulas — in fact, we often use a simple linear step-down at specific training steps. The important part is that the LR declines over time. We usually decay down to 0.4–0.5X 10e-5, but not much lower than that — to try to squeeze a little incremental optimization out of the latter part of the finetuning dataset.

· Gradient clipping — especially important with some models, and not for others.

#4 — Training passes — only use 1 Epoch — don’t use the same data sample more than once in a training cycle. We remember training CNNs in the relative old days, and it was common-place to run the same sample potentially dozens of times in 10–20 epochs (or more) in a particular training cycle. The power of the transformer is its grip and ability to learn patterns incredibly quickly. As a rule of thumb, if you need 2–3 full passes of your training set to start to see the targeted behavior, it is a good sign that you are on the right path, but that the training data set is too small. Once the training data is the right size, a single training should do it with the model “seeing” each sample only once. Some people might disagree — often times, there are recommendations to do 2–4 passes of each sample— but in our experience, this is a good guideline to avoid over-fitting and assess whether our training data is sufficiently large compared to our training objective. It is also a good test whether the training objective is sufficiently well-defined and realistic. Especially in a fine-tuning process where each sample will carry, relatively, a reasonable amount of influence, quality not quantity, is critical.

#5 -A real held-out testing dataset — machine learning conventional wisdom from the beginning of time is that you should hold out a test and validation set from the total set. Our approach is to develop a formal testing set from scratch that is similar in principles to the training dataset, but was not prepared in the same process. When samples are prepared in the same process and from the same source materials, often times, there will be an “implicit” set of similarities that can conflate the results, and lead to “test” results that are substantially better than “real” results seen in the wild. We would recommend building a (small, but well-designed) test dataset that purposefully includes a couple of adjacent areas that should be natural extensions of the training, but that are not formally in the training dataset. As an example, we may use invoices as a testing set, for a training set that uses financial tables, as we are looking to see that that the model can “generalize” to a related domain with very similar set of patterns. As you are working on a fine-tuning project, you will likely train the model multiple times, and it is extremely helpful to have a test benchmark to compare results and determine which checkpoint of the model is performing overall the best. We were surprised recently when we thought a particular checkpoint was under-performing because we were focused on a very specific question, only to realize that the model was failing on that one single test, but performing much better than any of our other checkpoints across the full testing set.

#6 — Iterate, iterate, iterate. To bake the perfect batch of cookies takes a lot of training runs. Keep iterating through steps 1–5 and review every single element in the details to keep optimizing and improving the results. Most big progress occurs from very small adjustments — cleaning up the formatting of the training samples, removing a few problematic samples, minor adjustments in the learning rates, etc.

We hope that you have picked up a couple of useful nuggets to aid you in your next finetuning project — Happy Finetuning!

To check out some of our fine-tuned models, please go to our repo home page on HuggingFace — LLMWare RAG Instruct Models.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware.

--

--

Responses (1)