A Pro’s Guide to Finetuning LLMs

David Shapiro
12 min readSep 23, 2023

--

I’m basically the Dale Earnhardt of finetuning models. If you don’t believe me, check out this video where I finetuned a movie script generator with GPT-3 in May of 2022 (the before times!) https://youtu.be/cOz3QJT1zU8 repo here: https://github.com/daveshap/MovieScriptGenerator

Large language models (LLMs) like GPT-3 and Llama have shown immense promise for natural language generation. With sufficient data and compute, these models can produce remarkably human-like text. However, off-the-shelf LLMs still have limitations. They may generate text that is bland, inconsistent, or not tailored to your specific needs.

This is where finetuning comes in. Finetuning is the process of taking a pre-trained LLM and customizing it for a specific task or dataset. With finetuning, you can steer the LLM towards producing the kind of text you want.

In this article, I’ll share my experiences and best practices for finetuning LLMs as an expert practitioner. I’ve finetuned hundreds of models, starting with GPT-2 and scaling up to GPT-3. I’ve also trained Open Source models as well as NVIDIA’s NeMo. By following the steps below, you too can harness the power of LLMs for your natural language tasks.

Sneak peak: you can check out my most famous finetuning dataset repository here:

https://github.com/daveshap/GPT3_Finetunes

You should also check out my Prompt Maestro article and video before reading this. If you are not a good prompt engineer, you will not be a good finetuning engineer! Master prompt engineering first!

https://medium.com/@dave-shap/become-a-gpt-prompt-maestro-943986a93b81

I’ve been finetuning LLMs since the release of GPT-2 in 2019. Finetuning was a necessity with early LLMs like GPT-2 and GPT-3, as they lacked the alignment and intelligence of today’s models. With GPT-3, I would need to provide hundreds or thousands of examples to get the consistent behavior I wanted.

Over time, LLMs have become more capable out-of-the-box. With models like Anthropic’s Claude and Google’s PaLM, you may not need much finetuning at all. The core capabilities are there.

However, finetuning is still a valuable technique when you want to specialize a model. For instance, you may want to steer text generation in a certain direction or have the model interface with a custom dataset or API. Finetuning allows you to adapt a general-purpose LLM into a more customized tool.

In this guide, I’ll share what I’ve learned from finetuning so many models over the years. These tips will help you effectively steer LLMs towards your goals.

Finetuning Teaches Patterns, Not Knowledge

The key thing to understand about finetuning is that it primarily teaches the LLM to generate patterns, not remember knowledge. When you provide examples for a task, the model learns heuristics for mapping input to expected output. But this is surface-level pattern matching.

Finetuning does not mean the LLM acquires deeper understanding or memories. The full model is not retrained. Only the top layers are adjusted, acting as a slight steering mechanism.

I cannot emphasize enough that finetuning only teaches patterns.

As an analogy, think of finetuning as putting a custom paint job and decals on a car. This alters the appearance as desired. But you are not changing anything under the hood. The core engine remains the same.

This differentiation is important. Many practitioners overestimate what finetuning can do. You cannot cram knowledge or reasoning ability into an LLM with just a few examples. The model will simply pattern match, not truly learn.

I cannot stress enough the centrality of understanding patterns versus knowledge. LLMs only ingest general knowledge during their main training phase or checkpoint updates.

Defining Patterns of Text

So what exactly is a “pattern of text” that finetuning can teach? We can define a pattern as any consistent convention or structure in language use.

Fiction writing follows distinctive patterns to convey a story. There is a sequence of dialog between characters, exposition and description texts, action statements to progress the plot, and interior monologues revealing inner thoughts. The balance and arrangement of these elements creates a unique storytelling style.

On a more structural level, documents like outlines and notes use bullet points to concisely summarize information. The bullets chunk content into scannable sections. Tables of contents also employ compact bullet points to provide navigation.

Relatedly, dialog patterns appear everywhere from plays and screenplays to instant messaging chats. Lines alternate between speakers, with punctuation delineating who is talking. This back-and-forth exchange moves the conversation forward rhythmically.

Even academic writing in scientific papers has formulaic patterns. There are standard sections like the abstract, introduction, methods, results, and conclusion. Peculiar conventions like passive voice and avoiding first-person pronouns set scholarly writing apart. The presence of author names, affiliations, citations, and technical terms also form a pattern.

Stepping back, patterns can emerge at many levels beyond just genres. The length of text generation can follow templates. For example, finetuning can steer models to produce long essays or short summarize depending on the use case. Even formatting like newlines, brackets, commas, and parentheses can form patterns an LLM learns to apply appropriately.

Some characteristics of text include:

  1. Overall length of the output generation. This can be dependent or independent of the input. For instance, for a mission statement generator, you want the output to always be a single sentence, irrespective of how long the input is.
  2. Length of sentences or paragraphs. This varies dramatically if you’re working with KB articles, science, fiction, news, etc. You may not even have complete sentences if you’re doing code or structured data.
  3. Types of sentences. Dialog, exposition, poetry, citations, explanation, etc.
  4. Overarching patterns or structure of the output. Is it a dialog? A screenplay? JSON?
  5. Word choice, language, lexicon. You can train the model to follow very specific patterns of words and language.

Finally, patterns also exist in aspects like tone, style, and other copyediting conventions. Finetuning can teach a model to adopt the hallmarks of Hemingway’s punchy prose versus Jane Austen’s elegant extended sentences. It can steer towards professional business language or casual conversation.

The key is consistency. Any language usage with regularities — whether functional or stylistic — can form a pattern for an LLM to internalize and replicate. This diversity underscores the power of finetuning for directing text generation.

A Series of Characters!

You know how ChatGPT has a very particular pattern and style? It tends to use lists, use formal language, and so on? This is a pattern.

Another way to think of a pattern is a series of characters.

9cb3008b-6f5c-48db-8339–3a9143fae368

Above is UUIDv4 (universally unique identifier, version 4). Let’s characterize this as a pattern of characters.

  1. First, it’s hexadecimal (meaning it only uses the characters 0 thru 9 and A thru F)
  2. Second, it follows the pattern of 8 chars — 4 chars — 4 chars — 4 chars — 12 chars, all separated by hyphens

So you can see how these two rules create a very rigid, distinctive pattern. Likewise, JSON follows a very strict pattern as well. Even fiction follows patterns, and every author has their own unique pattern.

Here’s another way to think about sequences of characters in your finetuning datasets:

  1. What characters are allowed? Which aren’t?
  2. How often, and what rules dictate carriage returns and newlines?
  3. What formatting characters are used (XML, Markdown, etc) if any?

Keep in mind that LLMs generate one token at a time as a sequence. Think of it like a type writer, hacking away at a ribbon of text. Even newlines are a linear sequence, the only thing that newlines do is change the way you view the text.

Mapping Inputs to Outputs

It’s not just about finetuning for certain text outputs. The model also needs to learn associations between inputs and expected outputs.

The nature of the inputs can vary greatly across tasks. Inputs could be instructions in natural language, structured data, raw text to summarize, and so on. It’s important to represent this diversity in your finetuning data.

Ideally, the training data encompasses the full breadth of possible inputs the model may encounter. This helps generalize robustly. For example, if the task is to process forms, the data should include short forms, long forms, forms with varying fields, odd formatting, and so on.

Think of each sample in your training data like an equation. Input is the left side of the equation, output is the right side. This input equals that output. They are a matched pair.

Exposing the model to edge cases during finetuning also helps handle unexpected inputs down the line. The model learns not to catastrophically fail when something is formatted oddly or instructions are unclear.

Essentially, finetuning trains the association between arbitrary inputs and target outputs. You want the model to learn how exactly you expect certain inputs to map to certain outputs.

Think of the LLM as an assembly line for text. The raw materials entering can be anything. Finetuning optimizes the manufacturing process to deliver the desired product every time. This input-output correlation is key.

With sufficiently diverse training data, the model will interpolate well for new inputs. The input-output mapping is distilled into a generalizable pattern. This enables the flexibility to handle novel data in a customized way.

Generalizing from Examples

Finetuning is like learning any new skill. Take driving a car for example. At first, you practice driving the same route under ideal conditions. But to become a truly skilled driver, you need experience across diverse situations. Different cars, locations, weather, traffic conditions, etc.

Once you’ve racked up enough diverse driving hours, you can generalize the skill. You’re able to smoothly drive any car that comes your way under various conditions. The breadth of practice allows you to interpolate and adapt.

This same principle applies in finetuning LLMs. You don’t want narrowly specialized training data. The examples need to widely cover the scope of what the model should handle.

Let’s say you want the model to summarize long news articles. The training data shouldn’t just use articles from the same publication or on the same topic. There should be variety in:

  • Article source (magazines, newspapers, blogs)
  • Article length (short, long, extra long)
  • Writing style (conversational, formal, satirical)
  • Topic (politics, business, arts)

This diversity encourages generalization across the problem space. Then the model can reliably summarize arbitrary new articles, even with quirks like unusual formatting.

Remember, finetuning follows the GIGO principle: garbage in, garbage out. Carefully curate training data that exposes the full breadth of inputs and desired outputs. This unlocks robust mapping from any A to any B.

The Dartboard Analogy

Now we arrive at the crux of this article: the dart board analogy.

Imagine that your dataset is a dartboard, and you want to hit a bullseye to “dial in” the performance.

Imagine that your dataset is a fist full of darts. Your goal is to train a model that can consistently hit the bullseye and achieve optimal performance. Each sample in your dataset represents a dart throw.

If all your darts cluster together in one small section of the board, your data set is dangerously lopsided. It’s like only practicing your dart throws from one spot very close to the board. When tournament time comes, you’ll be unable to hit the bullseye if you have to throw from farther back.

This clustering limits your model’s capabilities. The finetuned model will only generate text constrained within that narrow cluster, regardless of the input.

For example, let’s say you want to finetune a cooking assistant AI. If your training data only includes recipes for tuna sandwiches, that’s your tight dart cluster. As a result, your finetuned model will just keep spewing out tuna sandwich recipes, even if you ask it for a pasta dish!

Instead, you want your darts distributed evenly across the entire dartboard. Your training data should cover diverse examples spanning the scope of intended use. For our cooking AI, include recipes using different cuisines, ingredients, cooking methods, dish formats, etc.

This variability encourages the model to generalize broadly and hit bullseyes for any cooking prompt. When your darts cover the whole board, the model learns to hit any section on command.

So in your finetuning dataset, consciously sample for diversity like an archer practicing shots from all angles. Evaluate data distribution to ensure wide coverage instead of lopsided clusters. Broad variability unlocks flexible text generation capabilities.

Even for focused niches, include some variety to refine performance. For a tuna sandwich model, incorporate different recipes with diverse ingredients, preparation steps, styles, etc. This allows better tuning control within the specialty.

In summary, widely distributed darts grant your model versatility and finesse. Watch out for tight clusters that overly constrain capabilities. Seek broad diversity in your training data so your finetuned model consistently hits the bullseye.

Avoid Finetuning for Knowledge

Finetuning should not be used as a way to impart significant knowledge or memories to an LLM. It simply does not work for knowledge storage and retrieval. The training process only adjusts the top layers of the network. This serves to steer text patterns, not encode complex concepts.

Techniques like retrieval augmented generation (RAG) are better suited for knowledge functions. RAG allows models to retrieve and incorporate external knowledge on-the-fly from documents during generation. This provides more robust knowledge capabilities beyond what finetuning can achieve.

Instead, it’s best to focus finetuning on narrow, specialized tasks involving text patterns. Identify the key input and output text characteristics you want to link. These may involve content, style, tone, structure, length, and more. Finetune for one well-defined task at a time rather than cramming in multiple objectives. Finetuning is meant for specialization, not building Swiss army knives.

Furthermore, use highly diverse training data spanning edge cases. Real-world data is often messy and noisy, so finetune with a wide variety of examples. Include adversarial or broken examples too. If the model hasn’t seen dirty data during training, it won’t handle it well at test time when deployed.

Throughout the finetuning process, incrementally check outputs to ensure proper alignment. With this focused approach, finetuning can reliably map inputs to desired outputs for a particular task. But attempt to cram in knowledge at your peril — turn to other techniques like RAG for robust knowledge functions.

Best Practices for Finetuning

Finetuning requires a rigorous approach to get the desired outcomes. Here are some key best practices to follow:

First, finetune a model for one specialized task at a time, rather than trying to multitask. Finetuning produces a specialized text generation tool, not a knowledge store. Keep the scope narrow and well-defined. Think of finetuning as creating a customized chef’s knife rather than a Swiss army knife.

Second, focus closely on mapping particular inputs to desired outputs. Consider finetuning as an assembly line transforming raw text materials into a finished text product. Closely characterize the textual patterns you want to link between input and output. Conduct a systematic analysis on aspects like content, style, tone, structure, formatting, length, and more on both the input and output side. The model will then learn to reliably map any input to the desired corresponding output.

Third, use highly diverse training data spanning a wide variety of edge cases. Vary genres, content types, sources, lengths, and include adversarial cases. Having broad diversity encourages the model to generalize across the entire problem space rather than just memorize the examples. Err strongly on the side of too much variety in the training data rather than too little. Real-world inputs at test time will be noisy and messy, so training robustly prepares the model.

Fourth, remember that finetuning follows the GIGO (garbage in, garbage out) principle strongly. Carefully curate, prepare, and clean the training data to enable full generalization by the model across possibilities. Sloppy or narrow data will lead to poor performance.

Fifth, and finally, make sure you always include adversarial examples. Your finetune model may not be customer-facing, but it will still confront potential failure conditions or exploits. Think about how incorrect formatting or broken data might gum up your system. You may need to compensate for this with out-of-band checks rather than finetuning.

By following these rigorous best practices, finetuning can produce highly effective specialized text generation models tailored to the desired input/output mapping. Adhering to this disciplined approach is key to successful finetuning.

TL;DR — Best Practices

  1. Finetuning Teaches Patterns, Not Knowledge: Finetuning only adjusts the top layers of a model to recognize textual patterns. It does not impart deeper knowledge or reasoning abilities.
  2. Define Clear Input-Output Mappings: Characterize the link between inputs and desired outputs. Finetuning trains the model to reliably map arbitrary inputs to target outputs. Think of it like balancing an equation.
  3. Use Highly Diverse Training Data: Include a wide variety of examples spanning different genres, styles, formats, lengths, etc. This encourages generalization across the problem space. Variety is the spice of finetuned LLMs.
  4. Include Adversarial Examples: Incorporate broken, messy, or exploit data to train robustness against edge cases. Models will face dirty inputs when deployed. Remember that not all adversarial cases are hostile actors. Sometimes they are dumb users, sometimes they are error codes leaking into your data. Who knows?
  5. Avoid Narrow Data Clusters: Ensure training data is widely distributed, not lopsided clusters. Broad variability enables flexible generation capabilities. The exception here is that you may want narrow data clusters, particularly if you want a very precise tool.
  6. Focus on Specialized Tasks: Only finetune for one well-defined task at a time rather than cramming in multiple objectives. Finetuning produces specialized tools. Yes, RLHF created a general purpose chatbot in the form of ChatGPT, but already it is being specialized into multiple versions.
  7. Integrate External Knowledge: Use other methods like retrieval augmentation for knowledge functions. Don’t rely solely on finetuning for knowledge.

Ping me on LinkedIn, Upwork, or Patreon if you want any help with finetuning.

--

--