LLM domain adaptation using continued pre-training — Part 4/4

8 min readMay 21, 2024

Exploring domain adaptation via continued pre-training for large language models (LLMs)? This 4-part series answers the most common questions on why, how, and when to perform domain adaptation of large language models (LLMs) via continued pre-training.
Written by: Anastasia Tzeveleka, Aris Tsakpinis, and Gili Nachum

Part 1: Introduction
Part 2: Training data — sourcing, selection, curation and pre-processing
Part 3: Continued pre-training on AWS
Part 4: Advanced: Model choice and downstream fine-tuning — You’re here!

Advanced: Model choice and downstream fine-tuning

In part 1, we reviewed domain adaptation and the different approaches you can use to adapt an LLM to your specific domain. We also discussed how continued pre-training allows you to continue training an LLM on unstructured data.

In part 2, we focussed on the training data aspect of domain adaptation, building the foundation of our aligned model.

In part 3, we looked into the practical side of things. An easy way to get started with continued pre-training is by using AWS. AWS provides not just the infrastructure you need for your training but also services and capabilities that accelerate and simplify your domain adaptation journey.

In this part of the blogpost series we will delve into some more important conceptual decisions to take on the way to a domain-adapted model. This includes the initial model selection which is a tradeoff alongside multiple dimensions. Finally, we will explore some on other fine-tuning approaches and how to combine them with continued pre-training.

Which foundation model should I use?

When selecting a model for domain adaptation, amongst others you should consider the following factors:

Model access: Depending on the AWS service you use, you get access to different FMs (see also in part 3). Note that in some cases region-specific differences in model availability apply.

Modality: depending on the use case you might want to use a text-completion model (e.g. Amazon Titan, Llama-2 7B etc.) or an already aligned model (e.g. Llama-207b-chat) as starting point. While models with additional modalities exist (image, video, speech, …), this blogpost will focus on the text modality and LLMs.

Performance: to narrow down the list of models to choose from, you can begin by reviewing publicly available model evaluation results. Based on this you can create a short list of models and evaluate their performance on our specific task(s) using our own custom datasets. This can happen through automated or human evaluation (e.g. with foundation model evaluation features embedded into Amazon SageMaker Clarify or Amazon Bedrock). Smaller models can work well depending on the task. For example, Mistral-7B has worked when further pre-trained for a new language. Example base-model benchmarks include:

Budget (TCO): In general, larger models require more compute and potentially multi-GPU instances for training and serving across multiple accelerators. This directly impacts training and inference cost, complexity of training and inference, and resources and skills required, resulting on higher TCO for the entire project lifecycle. FM choice needs to be assessed against the short- and long-term budget allocated.

Training data availability: larger models require more data. Having an estimate of how much data is available in advance is useful when selecting models.

Licensing model: Both proprietary and open-source models come with licensing constraints that vary depending on where and how they will be used.

Governance, Ethics, Responsible AI: Every organisation has compliance guidelines alongside these dimensions. You can consult model cards and papers for details on how the models were trained.

Example: An organisation decides to consider open-source models such as LLaMA 3 and rule out proprietary models like Anthropic Claude or AI21Labs Jurassic for transparency reasons. They may also decide to only use the 7B-parameter version of the model to be able to train and serve them on single GPU instances.

Note: Depending on your use case, you may decide to perform supervised fine-tuning or preference alignment after continued pre-training of their model. This can impact the initial model choice. Please refer to the following section “Can/Should I further fine-tune the resulting model for a specific task?” for more details.

Can/Should I further fine-tune the resulting model for a specific task?

Yes! In fact, other alignment approaches like supervised fine-tuning or preference alignment can significantly uplift the models performance on specific tasks (in general, research has shown smaller fine-tuned models can outperform larger general-purpose models).

Example: For teaching a Mistral-7B model a new language, you can use the following approach:

Fine-tune using continued pre-training with dataset in local language and some english high quality data. The original tokenizer is replaced with a language specific tokenizer
Fine-tune using RLHF for preference alignment to improve quality and diversity
Evaluate results of both fine-tuned models plus the base model against each other using manual and automated evaluation to really pick the best model possible

For advanced readers:

There is no on-size fits all approach for domain adaptation. Overall, it’s experiment driven and usually it takes several iterations before you reach satisfactory results.

The diagram below illustrates the end-to-end lifecycle when fine-tuning models for domain adaptation which includes continued pre-training, supervised fine-tuning and (human) preference alignment steps. You may start either at the text-completion model or at a task fine-tuned model and iterate as needed, depending on the use case. The decision as to where to start will also affect your model selection process (in addition to the decision criteria outlined in the beginning of this blog) as to whether you want to use a base or aligned model.

*Domain adaptation end-to-end lifecycle*

When deciding the specific approach, few things to take into consideration include:

Task to be performed: Different use cases require specific model behaviour. While for some use cases a simple text-completion model (next-token-prediction) might be sufficient, most use cases require task-specific behaviour like chattiness, instruction-following or other task-specific behaviour. To meet this requirement, you can take a working backwards approach from the desired task to be performed. This means you need to define your specific fine-tuning journey to end at a model aligned to this specific task. With regards to the illustration this implies that the model must — aligned with the desired model behaviour — end in the blue, orange or green circle while the fine-tuning journey is defined alongside the possible paths of the flow diagram.

Choose the right starting point for you (as long as reasonable): While you should be very clear on where our fine-tuning journey should end, you can start anywhere in the flow diagram by picking a respective base model. This, however, needs to be reasonable — in times of model hubs with millions of published models, it can make sense to check if the fine-tuning step has not already been performed by someone else who shared the resulting model, especially when considering popular models in combination with open-source datasets.

Fine-tuning is an iterative, potentially recursive process: It is possible to perform multiple subsequent fine-tuning jobs on the way to your desired model. However, please note that catastrophic forgetting is something you need to keep in mind as models can’t encode an infinite amount of information in their weights. To mitigate this, you can leverage parameter-efficient fine-tuning approaches like LoRA as shown in this paper and blog.

Task-specific performance uplift targeted: Fine-tuning is performed to uplift a model’s performance in a specific task. If you are looking for performance uplift in linguistic patterns (domain-specific language, acronyms, etc.) or information implicitly contained in your training data, continued pre-training is the right choice. If you want to uplift performance towards a specific task, supervised fine-tuning should be chosen. If you want to align your model behaviour towards your actual users, human preference alignment is the right choice.

Data availability: Training data will also influence which path you choose. In general, organisations hold larger amounts of unlabelled textual data is as opposed to labelled data, and acquiring labelled data can be an expensive task. This dimension needs to be taken into consideration when navigating through the flow chart.

Example: You are building a chat model a use case and want to uplift performance in a specific vertical linguistic domain, e.g. legal language. Unlabelled data in form of legal documents is available. So you pick LLaMA-2–7B-chat as base model and perform continued pre-training on the corpus of legal documents. This results in a fine-tuned legal version of LLaMA-2–7B-chat.

With this working backwards approach alongside the above flow chart you can identify the model to start with and the path to take while traversing the fine-tuning flow diagram.

To make this a bit more obvious we are providing two examples:

Example 1: Following the example illustrated in the fine-tuning section above, you could constitute the desire of having an instruct model for our specific use case. However, you want to uplift performance in the BioTech domain. Unlabeled data in the form of research papers are available. You choose the LLaMA-2–7b model family as the desired starting point. Since Meta has not published an LLaMA-2–7b instruct model, you start from the text completion model LLaMA-2–7b-base. Then you perform continued pre-training on the corpus of research papers, followed by supervised fine-tuning on an open-source instruct dataset like the “dolly-15k” dataset. This results in an instruct-fine-tuned BioTech version of LLaMA-2–7B-base, which you call BioLLaMA-2–7b-instruct. In the next step, you want to align the model to our actual users’ preferences. You collect a preference dataset, train a reward model, and use RLHF with PPO to preference-align our model.

Example 2: In this example you are aiming to use a chat model for our use case, however aligned to our actual user’s preferences. You choose the LLaMA-2–7b model family as the desired starting point. After browsing open-source model hubs you figure out that Meta is providing an off-the-shelf chat-fine-tuned model LLaMA-2–7b-chat, which you can use as a starting point. In the next step, you want to align the model to our actual user’s preferences. For this you collect a preference dataset from our users, train a reward model and use RLHF with PPO to preference-align our model.

Conclusion

In this four-part series, we explored the what, why, and how of using continued pre-training for LLM domain adaptation.

We covered the fundamentals of domain adaptation, the different approaches that can be used, and the key factors to consider when selecting a foundation model and curating training data. We then took a practical look at how AWS services and capabilities like Amazon Bedrock, Amazon SageMaker JumpStart, and Amazon SageMaker Training can accelerate and simplify the continued pre-training process. From managed services to custom training environments, AWS provides flexible options to meet your domain adaptation needs. Finally, we dove deeper into advanced topics like model selection criteria, data preprocessing, and combining continued pre-training with other fine-tuning techniques for optimal performance on specific tasks.

Domain adaptation is an iterative process that often requires multiple stages and careful experimentation. By leveraging AWS’s powerful AI/ML infrastructure and services, you can streamline this process and create domain-specific LLMs that drive innovation and business value across your organization.

LLM domain adaptation using continued pre-training — Part 4/4

Advanced: Model choice and downstream fine-tuning

Which foundation model should I use?

Can/Should I further fine-tune the resulting model for a specific task?

Conclusion

Written by Aris Tsakpinis