Building a Conversational AI with Memory on AWS Series: Prepare Your Dialogue Data for Fine-tuning Falcon and Llama-2-chat on SageMaker

5 min readNov 29, 2023

#2 in the series: format your conversation data into Llama2-chat/Falcon prompt

Source: https://dev.to/thenomadevel/falcon-180b-vs-llama-2-who-wins-the-ai-battle-2cj3

Introduction

To build an LLM-powered conversational AI with memory, the LLM needs to respond based on past conversations. So, during training, we must add conversation history to the prompt. The way you structure your training data depends on the model and also the training techniques. In the below sections, I will use Llama2-chat and Falcon as examples.

Both llama2 and Falcon have a base version and an instruct version, e.g. Llama2–7b vs Llam2–7b-chat. So, there are in total three training mechanisms I can think of:

Finetune the base version with new data (no instruction)
Finetune the base version with new data (with instructions)
Finetune the instruct version with new data

It is hard to determine which way gives the best performance. Intuitively, fine-tuning an instruct model should give you better results than other methods. However, according to this post and my own experiments, this might not be the case. You probably want to try different methods and test fine-tuned models by interacting with them. I have tried methods #1 and #3, illustrated below.

Finetune Falcon with no instruction

The first mechanism is continuing the pertaining process. If we used this mechanism, the LLM wouldn’t understand the instructions. This way, LLM will complete the dialogue instead of returning a response. However, during inference, we can specify a stop token so the LLM can stop generating after it generates a response.

User enters:
I am feeling a little bit down today. Can you help me?

LLM's output:
### Assistant: Yeah sure! Can you tell me a bit more of what's going on?
### User: I failed my mid-term because I didn't study enough
### Assistant: I am sorry to hear that. How are your feeling now?
### User: ....
### Assistant:.....

If we specify the stop token as ### User, then we will only have a response

User enters:
I am feeling a little bit down today. Can you help me?

LLM's output:
### Assistant: Yeah sure! Can you tell me a bit more of what's going on?

To fine-tune your Falcon model to behavior this way, you can structure your training data like this.

{"text": "### User: ...### Assistant: ...### User: ...### Assistant :...", 
"text": "### User: ...### Assistant: ...### User: ...### Assistant: ...",
...}

Where each “text” key represents one dialog. You can also find a similar format here :

timdettmers/openassistant-guanaco · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Finetune llama2-chat

For the instruct/chat version, the LLM knows to follow instructions and won’t generate endlessly.

Instruction:
You are a helpful assistant. Answer the user's question below.

User enters:
I am feeling a little bit down today. Can you help me?

LLM's output:
Yeah sure! Can you tell me a bit more of what's going on?

You need to follow the exact same format they used for instruction tuning the base model. For llama2-chat, it is a little tricky, but I find this format works for me.

[INST] <<SYS>>
You are a helpful Assistant. Respond based on the following conversation history.
<</SYS>>

###User: Hi###You: Hello!###User: How are you?###You: [/INST] I am doing great!</s>

You feed all the instructions, including the system prompt (inside <<SYS>> and <</SYS>>) and conversation history, between [INST] and [/INST]. And after [/INST] will be the assistant’s desired output. Don’t forget to add </s> at the end, which tells the model to stop generating. Also, watch out for the spacing, don’t add or ignore empty spaces! Notice that there is no <s> in the beginning. This is because the tokenizer will add it for you (check your tokenizer, add_bos_token is True in default. )

For each dialogue you have, break it down into multiple examples. For example, if one of your dialogues has 10 assistant responses, you will need to break it down into 10 training examples, where each example consists of a unique context-response pair.

In Python:

conversation_history = "###User: Hi###You: Hello!###User: How are you?"
ai_response = "I am doing great!"
system_message = "<<SYS>>\n" + "You are a helpful Assistant. Respond based on following conversation history." + "\n<</SYS>>\n\n"
prompt = f"[INST] {system_message}{conversa  tion_history}###You: [/INST] " + f"{ai_response}<s>"

Process and Upload to S3

If you haven’t set up a SageMaker studio, please look at this great tutorial by Anish Mahapatra.

AWS SageMaker: Sign In, Set up SageMaker Studio and Use Jupyter Notebook Instance

Sign Up for AWS Account

anishmahapatra.medium.com

In your Sagemaker studio, create a notebook (I used Python 3.0 kernel). Some setup:

!pip install -q transformers datasets sagemaker s3fs --upgrade

import sagemaker
import boto3
import json

# This setup allows user to manage interaction with other AWS services
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

# Define your variables
RAW_DATA_PATH = "raw_data.txt"
TRAIN_DATA_PATH = "train.json"
VAL_DATA_PATH = "val.json"

Depending on your raw data and the prompt format needed, you need to define a function to preprocess the data. Sometimes, your dialogue could be longer than max_seq_length (this is a parameter in the trainer), which could cause problems. You can define a helper function to truncate some of the earlier memories.

import re

def preprocess(RAW_DATA_PATH):
# read raw data
# format into prompt (Use truncate helper if needed)
# train_test_split
# write data into files defined above
  return

 
# This is what I did for llama2-chat data
# Notice that I used split() instead of the tokenizer
# becuase it is much faster. Generally speaking
# 1 word ≈ 1.3 tokens, so the below function will truncate
# data to be less than 1.3 * 550 = 715 tokens
def truncate_helper(text):
  pattern = r'###Human:[^#]+###You:[^#]+'
  while len(text.split()) > 550:
    text = re.sub(pattern, '', text, count=1)

After processing the data, load your using the dataset library

from datasets import load_dataset
data_files = {"train": TRAIN_DATA_PATH, "val": VAL_DATA_PATH}

dataset_train = load_dataset("json", data_files=data_files, split="train")
dataset_val = load_dataset("json", data_files=data_files, split="val")

Take a look at the data to make sure the format is correct

print(dataset_val['text'][420])

Upload to S3

# upload dataset to s3 
train_input_path = f's3://{sess.default_bucket()}/processed/llama2-chat/train'
val_input_path = f's3://{sess.default_bucket()}/processed/llama2-chat/val'

dataset_train.save_to_disk(train_input_path)
dataset_val.save_to_disk(val_input_path)

Conclusion

Above I showed data preparation for two training mechanisms. The next article will be the actual fine-tuning for Falcon and Llama-2-chat on Sagemaker.

Check out the previous article in this series:

Building a Conversational AI with Memory on AWS Series: AWS Overview

#1 in the series: backend architecture and services overview

medium.com