Notes on Fine-Tuning LLMs

Eren Gölge
Machine Learns
Published in
4 min readSep 2, 2023
Stable Diffusion — “Fine-tuning large language models”

I recently dedicated some time to watching the new course on fine-tuning large language models, offered by Deep Learning AI. I took notes during my viewing and would like to share them here.

What is fine-tuning?

Specializing LLMs, teaching a new skill as opposed to expanding its knowledge base. GPT4 is finetuned for Copilot to be a code assistant like a doctor is trained to be a dermatologist.

What does fine-tuning do?

Learn the data rather than just accessing it

  • More consistent outputs
  • Reduced hallucinations
  • Specialize the model for a specific use case

Pros & Cons

From the course slides

Benefits

Performance

  • Reduced Hallucination: Reliable outputs tuned for the task.
  • Consistency: Maintaining consistency by responding uniformly to the same input.
  • Reduce unwanted information: Pre-trained models tend to be verbose and repetitive

Privacy

  • Keep your data in-house: It is not okay to share your private data with a 3rd party every time in a prompt.

Cost

  • Lower cost per request: Prompting LLM services is expensive, especially when combined with vector search and contextual information.
  • Control: For an enterprise, it is important to control technology to be adaptive and flexible enough when things change.
  • Transparency: If you know what goes in, what comes out, and how things work, you can act more efficiently and make better decisions.

Reliability

  • Uptime: It is easier to meet the uptime requirements of a model you own than relying on an external service.
  • Latency: Fine-tuned models are more efficient and you can optimize them according to your needs.
  • Moderation: You can easily feature flag or moderate inputs/outputs of your models.

Pretraining

  • Trained by self-supervised learning → next word prediction
  • Uses Unlabelled data → Wikipedia, Internet, etc.
  • After learning, the model learns language & knowledge
  • Expensive → $12m GPT3
  • Generally, data is not public → ChatGPT, LLaMA

Finetuning

  • Can be trained by self-supervised learning
  • Much less data
  • Labeled data based on the target task
  • Cheaper
  • Easier to find data → In the organization, the Internet
  • Data curation is more important → “Your model is what your data is”

Way to Finetuning

  • Check a pre-trained LLM performance on the task by prompt engineering.
  • Find cases LLM performs ≤ OK.
  • Get ~1000 samples for the task that shows examples better than the LLM.
  • Finetune a small LLM on this data.
  • Compare with the pre-trained LLM.
  • Repeat until you get satisfying results.

Instruction Finetuning

A specific way of fine-tuning that teaches models to follow instructions like a chatbot. It gave ChatGPT the ability to chat.

Data resources for instruction finetuning:s

  • In-House: FAQs, Slack…
  • Synthetic: Use a different LLM or templates to convert regular text to conversations. → Alpaca Model
  • Example sample format (Alpaca):
  • Instruction: Instruct the model to perform a specific task or behave in a certain way.
  • Input: User input
  • Response: Model’s answer based on the input and the instruction.

This is an example sample that the model sees in training.

sample = """

### Instructions: {instruction}

### Input: {Input}

### Response:

"""

Data Preparation

“Your model is what your data is”

Things to consider when creating a dataset:

  • Quality
  • Diversity
  • Real is better than synthetic.
  • More is better than less
  • Less high quality is better than more low quality.

Preparation steps:

  • Collect → Real or synthetic
  • Format → Instruction Template
  • Tokenization → Convert text to numbers
  • Split → Train/Dev/Test

Training (Fun Part)

When it comes to regular LLM training, selecting the appropriate model to fine-tune is crucial. It is recommended to begin with around 1 billion parameter models for typical tasks, but in my personal experience, smaller models can be incredibly effective for specific tasks. Therefore, it is important not to be swayed by the notion that bigger is always better in the LLM realm, and to conduct your own research.

In addition, there are specific techniques such as LoRA that can enhance the efficiency of your training. If you have limited computing resources, you can incorporate one of these methods into your fine-tuning process. By doing so, you can substantially reduce the amount of computing power needed without compromising performance.

Evaluation

The evaluation process marks the start of the fine-tuning process, rather than the conclusion. The objective is to continuously enhance our model by conducting error analysis after each iteration. We assess our current model (at first the base model), identifying errors and recurring issues. In the subsequent iteration, we determine if these issues have been resolved. If not, we incorporate additional targeted data to address these.

Some of the common issues might be misspellings, too-long responses, repetitions, inconsistency, and hallucinations.

There are different formats of evaluation:

  • Human evaluation → most reliable
  • Test suits → Rules for checking certain conditions. stop words, exact match, consistency, similarity check by embeddings, etc.
  • ELO comparison → A/B test between multiple models.

We can also use common benchmark datasets to compare our system:

  • ARC → School questions
  • HellaSwag → Common sense
  • MMLU → Math
  • TruthfulQA → Falsehood

Thanks for reading Machine Learns! Subscribe for free to receive new posts and support my work.

--

--