Fine tuning Vs Pre-training

Eduardo Ordax
4 min readJan 15, 2024

--

The objective of my articles is to ensure clarity and simplicity in technical explanations. To achieve this, I will skip over certain complex concepts that, while potentially necessary, may not add significant value to the learning experience.

Image generated by DALL-E

Language models have made significant strides in natural language understanding and generation. In this blog post, we will dive into the topic of continuous pre-training and highlight its distinctions from fine-tuning strategies. We will showcase various examples of this innovative approach that holds the potential to reshape the future of language models.

Let’s begin by clarifying the difference between fine-tuning and continuous pre-training. They may sound similar, but they are two distinct processes, often mistakenly referred to as just “fine-tuning.”

LLMs have exhibited a profound understanding of natural language, improving performance on an array of tasks. Using open web data has helped in creating general-purpose LLMs with a broad range of capabilities. General-purpose LLMs are however not “specialists”; for example, while LLMs could write good news articles, it would be hard-pressed to write specialized legal documents.

Most often, we refer to fine-tuning when we take a pre-trained model and adapt it for a specific task (such as classification or question-answering) or dataset. However, if you find that this target dataset is from a specific domain, and you have a few unlabelled data that might help the model to adapt to that particular domain, then you can do something called MLM or MLM+NSP ‘fine-tuning’ (aka continuous or further pre-training).

To create domain-specific Language Models (LLMs), you have two main approaches:

  1. Training from Scratch: Building specialized LLMs entirely from domain-specific data.
  2. Continual Pre-training: Enhancing existing LLMs by further pre-training them with domain-specific data.

These two methods allow you to tailor LLMs to specific domains effectively.

Indeed, there is a distinction between fine-tuning for a specific task and acquiring domain-specific knowledge. Now, let’s delve deeper into these topics to explore the nuances of pre-training from scratch, continuous pre-training and fine tuning.

Fine-tuning

Fine-tuning employs labeled data to fine-tune the model’s parameters, tailoring it to the specific nuances of a task. This specialization significantly enhances the model’s effectiveness in that particular task compared to a general-purpose pre-trained model.

Example of fine-tuning a LLaMA-based model (Image created by the author)

Alpaca and Vicuna are fine-tuned versions of LLaMA model with the capability to engage in conversations and follow instructions. Consequently, its behavior is expected to resemble that of ChatGPT.

But, how good are they? According to their website, the output quality of Vicuna (as judged by GPT-4…) is about 90% of ChatGPT, making it the best language model you can run locally. That means by fine tuning a model, you can get a much better version of the based model for a specific task.

Relative Response Quality Assessed by GPT-4 (from Vicuna website)

Pre-training

Pre-training usually would mean take the original model, initialise the weights randomly, and train the model from absolute scratch on some large corpora.

Further/Continuous pre-training means take some already pre-trained model, and basically apply transfer learning — use the already saved weights from the trained model (checkpoint) and train it on some new domain (i.e financial data).

Example of further pre-train a Pythia based model (Image created by the author)

As shown in the picture above, continuous pre-training relies on the concept of transfer learning. After a model has undergone initial pre-training, it can apply its learned language patterns to new datasets. This approach utilizes unlabelled data from a particular domain, enabling the Language Model (LLM) to enhance its comprehension and performance in specific knowledge domains, such as finance, law, or healthcare.

As we can see in the table below, the continued pre-trainining model versions FinPythia-6.9B and FinPythia-1B exhibit superior performance on tasks FPB, Headline, and NER while showing comparatively lower results on the FiQA SA task compared with Pythia counterparts. For more details, please refer to the paper linked.

Last re:Invent, AWS announced that you can now privately and securely customize foundation models (FMs) with your own data in Amazon Bedrock to build applications that are specific to your domain, organization, and use case. Amazon Bedrock provides capabilities to easily fine tune or continuous pre-train LLMs using the Amazon Bedrock console or APIs.

In summary, Language Models (LLMs) acquire knowledge through a combination of pre-training and fine-tuning. Fine-tuning is essential for tailoring models to specific tasks like summarization, Q&A, or classification. Meanwhile, continuous pre-training enhances LLMs with deeper domain knowledge in areas such as medicine, finance, or law. Depending on your use case, you can leverage fine-tuning, continuous pre-training, or both approaches for optimal results.

--

--

Eduardo Ordax

I work as the AI/ML Go to Market EMEA Lead at AWS, where I assist customers around the world in harnessing the full potential of Artificial Intelligence.