LLM domain adaptation using continued pre-training — Part 1/4

Gili Nachum
3 min readMay 8, 2024

--

Exploring domain adaptation via continued pre-training for large language models (LLMs)? This 4-part series answers the most common questions on why, how, and when to perform domain adaptation of large language models (LLMs) via continued pre-training.
Written by: Anastasia Tzeveleka, Aris Tsakpinis, and Gili Nachum

Part 1: Introduction — You’re here!
Part 2: Training data — sourcing. selection, curation and pre-processing
Part 3: Continued pre-training on AWS
Part 4: Advanced: Model choice and downstream fine-tuning

What is LLM domain adaptation and when is it used?

Domain adaptation is the process of customizing a generative AI foundation model (FM) that has been trained on massive amounts of public data to increase its knowledge and capabilities for a specific domain or use case. This may mean adapting the model to excel at tasks in verticals like law, health, or finance, enhancing abilities in particular human languages, or personalizing the model to a company’s unique concepts and terminology. Domain adaptation is a powerful technique for making generative AI models and solutions enterprise-ready.

What is continued pre-training with respect to domain adaptation?

Continued pre-training (or continuous pre-training) refers to the practice of taking a foundation model such as Amazon Titan or Mistral-7B and progressing the training using new large quantities of unstructured data. The training process is similar to the one used for the original pre-training and new dataset is commonly referred to as the training set. Continued pre-training is often used for domain adaptation where the training set contains domain specific data such as manuals, documents, wiki pages, emails, new language, FAQs. The pre-training process updates the model parameters and the model learns the domain knowledge, style, terminology and governing principles.

Figure 1: It’s far cheaper and faster to create domain adapted LLM via continued pre-training over an already pre-trained based model, than to pre-train from scratch

domain adaptation via pre-training vs continued pre-training

The most popular models used for text generation tasks and as a result for continued pre-training are decoder-only, autoregressive models. These models are pre-trained using unidirectional casual language modeling to predict the next token based only on the previous token. In practice, this comes down to masking tokens of the input sequence passed to the model during the training forward pass. During pre-training, these models ingest large amounts of unstructured, unlabelled data, and use self-supervised learning to create the labels required to learn the next-token-prediction task in a self-sufficient way. Continued pre-training gives you flexibility as you can perform changes to the model tokenizer, embeddings layer etc. It can also be combined with other techniques for domain adaptation.

What are the alternatives to continuous pre-training for domain adaptation?

You can perform domain adaptation with the following techniques:

  1. Prompt engineering & in context learning where system and user prompts are used to guide FMs to generate desired outputs that match the human expectations.
  2. Retrieval Augmented Generation (RAG): Generative AI applications that use RAG first pull information that is relevant to the user’s query from a domain specific data source and then pass the query together with this extra information to the FM.
  3. Supervised fine-tuning: Using pairs of prompt and completion for fine-tuning
  4. Pre-training from scratch: A small number of companies/organizations may choose to train a model from scratch on a wide range of domain specific data.
Alternatives comparison

Next

Part 1: Introduction — You’re here!
Part 2: Training data — sourcing. selection, curation and pre-processing
Part 3: Continued pre-training on AWS
Part 4: Advanced: Model choice and downstream fine-tuning

--

--