Fine-Tuning chat-based LLM with Multi-Turn Conversational Data (Part I)

7 min readJan 18, 2024

(apologies i’ve been changing job, this post has been outdated, hf added support for the positional encoding etc, some of these might still be helpful so i’ll leave it here, do take a read if you have a few minutes)

I was tasked with fine-tuning a chat based LLM using a Multi-Turn conversational data in which the user and assistant take turns to reply.

Taken from https://itnext.io/building-a-multi-turn-chatbot-with-gpt-and-sagemaker-a-step-by-step-guide-7d75f33ccea1 — Borrowed from a nice article here

As we know, LLMs are first trained to predict next token on massive amount of data (trillions of tokens) and it gives us the foundation model. The foundation model are then further fine-tuned using supervised fine-tuning and it gives us a wide range of LLMs with different core functionalities. For example mistralai/Mixtral-8x7B-Instruct-v0.1 refers to the model being fine-tuned with instruction-typed data and it follows instruction better. meta-llama/Llama-2–7b-chat-hf on the other hand means the model is optimised for chat functionality. The chat-based model is our focus here which is quite different to single-turn question and answering LLM. The chat-based model needs to be able to generate a new response while being aware of the recent conversation.

Install pre requisite libraries before you continue, assuming you’re using a notebook then

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U datasets
!pip install -q -U flash-attn
!pip install -q -U trl

I will fine-tune the Llama-2–7b-chat-hf model using my custom dataset

Dataset Processing/Preparation

Not once have i seen people claiming to showcase fine-tuning a chat-based model using dataset loaded like below and pass it directly to huggingface trainer instance:

from dataset import load_dataset
dataset = load_dataset('timdettmers/openassistant-guanaco')

this is completely wrong! If you pass it directly to the trainer class, it will train the model to predict next token starting from the first token of user’s input. But what you really want is for the model to learn to generate response given instruction or conversation history. What this means is that we need to mask the user’s inputs in each conversation round so that they do not contribute to the loss during the model training. Same goes for instruction LLM fine-tuning, you are not trying to get the model to predict the next token in the instruction too, instead you should mask the instruction and train the model to predict the response.

I prepared my dataset in the following format because i want to use the huggingface’s chat template built for chat models.

# a demo of a SINGLE multi-turn conversation sample 
[{'role':'user', 'content': 'hello, how are you?'},
 {'role':'assistant', 'content': 'I am good, thanks, and you?'},
 {'role':'user',...
 ...]

Different LLMs uses different chat templates to format their data for training. It’s best to follow their template formats to benefit from their trained model. For the llama2 chat model the above conversation sample will be turned into

<s><INST> hello, how are you? </INST> I am good, thanks, and you? </s><s><INST>...

where the user’s content will be wrapped around <INST> and </INST>, also each turn starts with <s> and ends with </s>. I apply the chat template to my custom dataset in pandas dataframe (after i created the llama2 tokenizer)

from transformers import AutoTokenizer
checkpoint = 'meta-llama/Llama-2-7b-chat-hf'
tokenier = AutoTokenizer.from_pretrained(checkpoint,
                                         padding='right')
tokenizer.add_special_tokens({'pad_token':'[PAD]'})

df_final['template_formatted_conversation_turns'] = df_final['conversation_turns'].apply(lambda x: tokenizer.apply_chat_tempalte(x,tokenize=False))
# all samples will be formatted in the form <s><INST>...</INST></s>...

since llama 2 doesn’t have a padding token, we need to add one to the tokenizer and we also need to resize the model’s embedding matrix later because of the extra special token.

NOTE: you often see in online tutorials that people assign padding token to be same as the end of sentence (eos) token. you MUST NOT do that because it will cause problem for multi-turn conversation data (eos appears in multi places, confusing the model)

Also note that i’ve used right padding here, there’s a pitfall here if you use left padding, if left padding is applied to shorter sequence in a batch, then the positional embedding will be messed up. The positional embedding will not adjust itself according to the padding token.

Next we can go ahead and create a dataset from the formatted conversations

import torch
from datasets import Dataset
from trl import DataCollatorForCompletionOnlyLM

dataset = Dataset.from_list(df_final['template_formatted_conversation_turns'].apply(lambda x: tokenizer(x, return_length=True)).to_list()) 
response_template = '[/INST]'
instruction_template = '[INST]'
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer)

note that when i tokenize the dataset i passed in the parameter return_length=True to have it return the number of tokens in the sample along with other outputs, it’s because i want to group similar lengthed sequences into same batch so that minimum padding is required, this will make the training more efficient. If you’re not sure what this means, take a look at this article about bucketizing the samples.

Just a side note here, in huggingface official tutorial as well as in llama published paper, the samples are ‘packed’ together and then truncated at the maximum allowed length. I strongly recommend not doing that for fine-tuning for various reasons:

if two unrelated sequences are packed together then the causal attention will cause the preceding sample to contaminate the laters
the positional information will confuse the model
custom dataset is usually small that we don’t need to pack them together

Back to the code snippet, i used DataCollatorForCompletionOnlyLM with instruction template as well as response template. For illustration i will load the dataset and show a batch below

# Just for illustration
dataloader = torch.utils.data.DataLoader(dataset=dataset, 
                                         collate_fn=collator, 
                                         batch_size=2)
for batch in dataloader:
    print(batch)
    break
--------------------------------------------------------------
{'input_ids': tensor([[    1,     1,   518, 25580, 29962,  6804,  1258,   366,  6548,   701,
         29973,   518, 29914, 25580, 29962,   306, 13631,   701,   297, 12524,
         29911, 11937, 29946, 29896, 29889, 29871,     2, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000],
        [    1,     1,   518, 25580, 29962,  6804,  1258,   366,  6548,   701,
         29973,   518, 29914, 25580, 29962,   306, 13631,   701,   297, 12524,
         29911, 11937, 29946, 29896, 29889, 29871,     2,     1,   518, 25580,
         29962,  1724,   471,   596,  7271,   763, 15678,   701,   727, 29973,
           518, 29914, 25580, 29962,   739,   471,   263, 24252,  2058, 29892,
           411,  4933, 29899, 28798, 16661,   322,   889,  1607, 12088,   886,
          1641,  3619, 29889, 29871,     2]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
'length': tensor([[27],
        [65]]), 
'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,   306, 13631,   701,   297, 12524,
         29911, 11937, 29946, 29896, 29889, 29871,     2,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100],
        [ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,   306, 13631,   701,   297, 12524,
         29911, 11937, 29946, 29896, 29889, 29871,     2,     1,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,   739,   471,   263, 24252,  2058, 29892,
           411,  4933, 29899, 28798, 16661,   322,   889,  1607, 12088,   886,
          1641,  3619, 29889, 29871,     2]])}

The attention_mask shows the first sample (shorter one) is right padded with 0 and the labels have some positions filled with -100. pytorch will ignore -100 while computing loss so that’s how the user inputs are masked to prevent them from contributing to the loss.

What the collator does is that it will use the instruction/response templates to locate the starts of user input and starts of assistant response. It then fill those sections with -100. Take an example conversation for illustration

<s><INST> Hello, how are you? </INST> I am good </s><s><INST> what time is it? </INST> It's 3 o'clock </s>
Tok1, Tok2, ...                Tokn1, ....          Tokn2 ......                 Tokn3 ...................
-100, -100, ...                -100 , 21, 12, ... 3,-100, -100, ..................-100, 22,157,231,.......

the last line is the corresponding ‘label’ where each turn <s><INST>…</INST> are masked with -100, only the assistant response are not masked and only these token predictions during training will contribute to the loss.

And that’s the dataset created along with the collator, both will be passed to the trainer class, the trainer class will have two parameters set

group_by_length=True, this will batch together similar length samples to improve training efficiency
length_column_name=‘length’, default value is length, but i want to show that it’s using the extra length outputs we created from the tokenizer, the dataloader will use it to batch similar length samples

That’s the dataset sorted! Next we can feed it to the trainer for fine-tuning.

Fine-Tuning chat-based LLM with Multi-Turn Conversational Data (Part I)

Dataset Processing/Preparation

Written by Bin Xue