LLM domain adaptation using continued pre-training — Part 2/4

3 min readMay 21, 2024

Exploring domain adaptation via continued pre-training for large language models (LLMs)? This 4-part series answers the most common questions on why, how, and when to perform domain adaptation of large language models (LLMs) via continued pre-training.
Written by: Anastasia Tzeveleka, Aris Tsakpinis, and Gili Nachum

Part 1: Introduction
Part 2: Training data — sourcing, selection, curation and pre-processing — You’re here!
Part 3: Continued pre-training on AWS
Part 4: Advanced: Model choice and downstream fine-tuning

Training data — sourcing, selection, curation and pre-processing

In part 1, we reviewed domain adaptation and the different approaches you can use to adapt an LLM to your specific domain. We also discussed how continued pre-training allows you to continue training an LLM on unstructured data.

In this part of the blogpost series, we will delve into training data aspects like data sourcing, selection, curation and pre-processing. They are important conceptual decisions to take since they are building the foundation of our future domain-adapted model.

What type of data do I need?

The type of data needed for continued pre-training depends on the domain and use case. Data selection and curation are crucial to domain adaptation. Thereby, the following aspects should be taken into consideration:

Data Selection: You need to carefully select the set of data to be ingested into the model. Typically, for this task you will need high quality, large (millions of tokens or more) datasets. However, these datasets are still way smaller than pre-training datasets (trillions of tokens or more).

Domain-specific data: The nature of the domain adaptation approach implies that the datasets used are a curated corpora of unlabelled and (optionally) labelled data specific to an organisational, knowledge or task-specific domain. In other words, you can use any full text document in natural language you consider to be of relevant content and sufficient quality. Example datasets include in-house data (e.g. internal user manuals, internal documentation, legal contracts, research papers, court case documents, medical journal articles, legal articles, customer support logs etc.), web-data (e.g. news, social media posts etc.), or even synthetic data. While this data can be sourced differently (document repositories, human-created content, …) it is important to carefully select the data with respect to quality, but also topics like confidentiality and IP, licensing, PII and others.

Combination of domain specific and web data: Providing only domain specific data can potentially degrade overall model performance for certain use cases. For example, if a model is adapted using vast amounts of text in a new language, it can forget logic learned when pre-trained on coding examples that it learned from English text during pre-training . As a result, a it is often advised to augment the original dataset with non-domain specific high quality data.

Data composition and curation needs for pre-training vs. fine-tuning approaches like continued pre-training

How to load the data?

While a lot of source-specific open-source data loaders exist, the LangChain framework has become more and more popular in the domain of LLM-powered applications. LangChain provides a broad selection of pre-built data loaders, amongst them the WebBaseLoader for scraping websites. It can be used as shown in the code sample below.

Data sourcing through web-scraping with LangChain WebBaseLoader dataloader

How to pre-process the data?

Before feeding the data into the model, you typically perform several pre-processing steps to further enhance the data quality, amongst them:

Quality-related pre-processing, e.g. formatting, de-duplication, PII filtering. Below code sample showcases stripping of spaces to enhance data quality and information density.

Stripping of spaces as example for quality-related pre-processing

NLP-related pre-processing based on the characteristics of the respective transformer model. This involves tokenisation, projection into the numerical vector space with static encoders (=embedding) and a model’s vocabulary (since neural networks can’t process the tokens as strings), chunking of the input data based on the model’s context size (this is what a model can process in one forward pass through the neural network) and more. Below example shows chunking into pieces according to the model’s context length as well as tokenization with model-specific tokenizer, in this case LLaMA2–13b-chat.

Chunking and tokenization as example for NLP-related pre-processing tasks

LLM domain adaptation using continued pre-training — Part 2/4

Training data — sourcing, selection, curation and pre-processing

What type of data do I need?

How to load the data?

How to pre-process the data?

Next:

Written by Aris Tsakpinis