Enhancing LLMs: A Journey like TUI’s

PhD Luis Dias
TUI Tech Blog
Published in
15 min readJul 10, 2024

Authors: John Ruiz and Luis Dias

Imagine entering a bustling travel agency, ready to embark on your dream vacation. The friendly travel advisor at TUI greets you with a warm smile and asks, “Where would you like to go?” You share your preferences, desires, and even a few quirky travel anecdotes. And just like that, TUI begins tailoring your journey — customising it to fit your unique tastes.

Now, let’s shift gears from the travel industry to the world of large language models (LLMs). At TUI, we have mastered the art of listening to our customers. We understand that a one-size-fits-all approach doesn’t cut it. Instead, we fine-tune our services based on real-world interactions and valuable feedback. It’s this customer-centric mindset that drives TUI’s success.

But what if we could apply a similar approach to LLMs? What if we could make language models not just grammatically correct, but also attuned to the nuances of user interactions? Enter Reinforcement Learning with Human Feedback (RLHF).

RLHF: Tailoring LLMs for Real-World Applications

Our journey begins with TUI as our inspiration. Just as TUI refines travel experiences, we aim to enhance LLMs by leveraging customer feedback. Why settle for generic LLM models when we can adapt them to specific contexts and tone of voice? In this article, we’ll explore the overall concept of RLHF, namely:

  1. Foundational Concepts: We’ll dive into available methods for fine-tuning pre-trained LLMs.
  2. Human Feedback and Reinforcement Learning (RL): Learn how human feedback data can be utilized alongside reinforcement learning to fine-tune LLMs.
  3. Python code examples: For the tech-savvy readers, we’ll provide concrete examples in Python. Expect snippets, key concepts, and techniques you can apply right away.

Our goal with this article is to provide a high-level overview of fine-tuning LLMs and how RLHF integrates with TUI’s reality. In upcoming articles, we’ll explore:

  • Scaling Human Feedback Collection: Strategies for efficiently gathering user insights.
  • Fine-Tuning with RL: A deep dive into RL techniques customized for LLMs.
  • End-to-End Implementation in AWS: We’ll guide you through the entire fine-tuning workflow.

How to adapt and align LLMs?

LLMs adaptation and alignment can be achieved through two primary methods: prompt engineering and fine-tuning. Prompt engineering involves designing effective prompts that guide the model to generate desired responses, acting as a form of “soft” programming. On the other hand, fine-tuning is a more “hard” programming approach where the model is further trained on a specific task-oriented dataset, enabling it to better align with the task’s requirements and nuances. Let’s explore each of these techniques in detail.

Prompt Engineering

At its core, prompt engineering enables direct interaction with LLMs using plain language prompts. It’s a departure from the traditional approach that requires deep knowledge of datasets, statistics, and modelling techniques. Instead, prompt engineering empowers developers and practitioners to communicate with LLMs effectively, even without specialised expertise. Prompt engineering is akin to providing precise instructions to a skilled assistant. It’s about crafting prompts that guide LLMs toward desired responses.

What is prompt engineering? (source)

In this article, we will focus on the popular types of prompts. For a complete list of the available techniques, please check the following prompt engineering guide.

Direct prompting (Zero-shot)

Direct prompting examples with GPT-3.5-Turbo (source)

These prompts explicitly guide the model by providing specific instructions. For instance:

  • “Translate the following English sentence to French: ‘Hello, world.’”
  • “Summarize the given news article in one paragraph.”
from openai import OpenAI

# Set your OpenAI API key
api_key = "<YOUR_API_KEY>"

# Create the OpenAI client
client = OpenAI(
api_key=api_key,
)

# Generate a response using zero-shot prompting
def generate_response(client, prompt):
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo", # Choose the appropriate engine (e.g., gpt-3.5-turbo)
)
return response.choices[0].message.content

# Define your prompt
prompt = '''
Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:
'''

# Get the model's response
chat_response = generate_response(client, prompt)
print("ChatGPT's response:", chat_response)

Python output:

ChatGPT’s response: neutral

Prompting with examples (few-shot prompting)

Few-shot prompting example (source)

Few-shot prompting involves presenting the model with one or more examples, illustrating precisely the desired behaviour for the generated output.

# Define your prompt 
prompt = '''
Instruction: Please determine if the sentiment of the Customer Feedback is "Positive" or "Negative".

Example 1:
Customer Feedback: "The hotel was amazing, the staff was friendly and helpful."
Sentiment: Positive

Example 2:
Customer Feedback: "The flight was delayed, and the customer service was poor."
Sentiment: Negative

Example 3:
Customer Feedback: "The tour guide was knowledgeable and the experience was unforgettable."
Sentiment: Positive

Example 4:
Customer Feedback: "The booking process was confusing and frustrating."
Sentiment: Negative

Example 5:
Customer Feedback: "The cruise was fantastic, everything was well-organized."
Sentiment: Positive

Customer Feedback: "The food could have been better!"
Sentiment:
'''

# Get the model's response
chat_response = generate_response(client, prompt)
print("ChatGPT's response:", chat_response)

Python output:

ChatGPT’s response: Negative

Chain-of-thought prompting

Chain-of-Thought prompting compared to standard prompting (source)

Chain of Thought (CoT) prompts represent an improved version of few-shot prompts, wherein examples within the prompt are utilised to prompt the LLM to elucidate its reasoning. This approach proves particularly advantageous for tackling complex tasks that necessitate thoughtful deliberation before generating a response.


# Define your prompt
prompt = '''
Q: Yesterday Filipa had 5 cities in her travel shortlist.
However, today she added 2 more countries.
Each country has 3 cities she would like to visit.
How many cities does she have now in her travel shortlist?

A: Filipa started with 5 cities in her shortlist.
2 countries with 3 cities is 6 cities. 5 + 6 = 11. The answer is 11.

Q: Yesterday Marley had 10 cities in his travel shortlist.
Today he removed 5 cities and added 3 more.
How many cities are in the travel shortlist?
'''

# Get the model's response
chat_response = generate_response(client, prompt)
print("ChatGPT's response:", chat_response)

Python output:

ChatGPT’s response:
A: Marley started with 10 cities in his shortlist.
He removed 5 cities, so he was left with 10–5 = 5 cities.
Then he added 3 more cities, so he now has 5 + 3 = 8 cities in his travel shortlist.

Zero-shot chain-of-thought prompting

Building upon the concept of zero-shot prompting discussed earlier, this approach enhances a zero-shot prompt by appending an instruction: “Let’s think step by step.” Consequently, the LLM is empowered to generate a coherent chain of thought, often resulting in more precise answers. This method proves highly effective in guiding LLMs towards generating accurate responses, particularly beneficial for addressing challenges such as logic-based problems.

# Define your prompt 
prompt = '''
Yesterday John had 5 cities in his travel shortlist.
However, today he added 2 more countries.
Each country has 3 cities he would like to visit.
How many cities does he have now in his travel shortlist?
Let's think step by step.
'''

# Get the model's response
chat_response = generate_response(client, prompt)
print("ChatGPT's response:", chat_response, sep="\n")

Python output:

ChatGPT’s response:
First, let’s calculate the original number of cities John had in his shortlist: 5 cities.

Next, let’s calculate the number of cities from the 2 additional countries: 2 countries x 3 cities each = 6 cities.

Finally, let’s sum up the original number of cities and the cities from the additional countries: 5 cities + 6 cities = 11 cities.

Therefore, John now has 11 cities in his travel shortlist.

Shortcomings of prompt engineering

Prompt engineering offers a pathway for individuals to tailor language models like GPT-4 to their unique needs without any AI expertise! As demonstrated, developers craft prompt instructions to direct the model’s outputs, rendering this sophisticated technology accessible to all. Nonetheless, prompt engineering comes with its constraints. Performance plateaus at a certain threshold, and smaller models grapple with limited capability for few-shot learning. Moreover, the expansive context windows utilized by larger models demand substantial computational resources, potentially burdening devices. Additionally, prompts remain static, devoid of real-time optimization. Thus, while prompt engineering marks a commendable initiative, its capacity to guide an LLM is ultimately capped. For this reason, fine-tuning can propel us further by customizing models to specific domains, tasks, and user preferences.

Fine-tuning

Fine-tuning is a type of supervised machine learning that enhances a model’s performance by continuously comparing its output for a given input (e.g., instruction prompt with dialogue) to the ground truth label (e.g., baseline summary). This type of fine-tuning requires updating all model weights, which demands sufficient memory and computational resources to store and process gradients, optimizers, and other components updated during training. Consequently, memory optimization and parallel computing strategies are beneficial. Here’s a quick overview of the fine-tuning process main steps:

LLM fine-tuning high-level process

Prepare the Training Data

The training dataset consists of prompt-completion pairs tailored to the targeted task. Each pair includes an instructional component. For example, to improve the model’s summarization ability, construct a dataset with examples initiated by the instruction “summarize” or a similar phrase. Publicly available datasets or prompt template libraries can be utilised.

SQUAD dataset example (source)

Transform Existing Datasets

Convert existing datasets into instruction prompt datasets. Each prompt contains both an instruction and an example from the dataset.

Instruction tuning example (source)

Fine-Tuning an LLM

  • Divide the dataset into training, validation, and test splits to ensure robust evaluation and generalisation of the model.
  • During fine-tuning, prompts from the training dataset are selected and passed to the large language model (LLM), which generates completions. Compare these completions with the training data responses.
  • The LLM output is a probability distribution across tokens. Use the standard cross-entropy function to calculate the loss by comparing the distribution of the completion with the training label token.
  • Update model weights using standard backpropagation. Process multiple batches of prompt-completion pairs over several epochs for weight updates.
OpenAI GPT-3.5-Turbo fine-tuning example (source)

Below is the code for implementing single-node fine-tuning using a model from the Hugging Face model hub.

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict

# Load the samsum dataset and select 100 samples from each split
dataset = load_dataset('samsum')
dataset['train'] = dataset['train'].select(range(100))

# Prepare the dataset by converting it into instruction prompt format
def preprocess_function(examples):
inputs = ["summarize: " + dialogue for dialogue in examples['dialogue']]
model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples['summary'], max_length=150, truncation=True, padding="max_length")

model_inputs["labels"] = labels["input_ids"]
return model_inputs

# Initialize tokenizer and model
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Apply preprocessing
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
save_total_limit=1,
save_steps=100,
eval_steps=100,
logging_steps=20,
report_to="none"
)

# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation']
)

# Start fine-tuning
trainer.train()

# Save the final model
trainer.save_model("./finetuned_model")
tokenizer.save_pretrained("./finetuned_model")

Fine-tuning with human feedback

In the context of the travel industry, where customer feedback plays a pivotal role in shaping user experiences, the concept of RLHF emerges as a powerful tool. Rather than relying solely on explicit prompts, RLHF leverages user preferences to enhance language models (LLMs).

Instead of instructing the model explicitly, what if we allowed users to express their preferences? RLHF operates on this premise. When prompted, the model generates one or more responses. Users then provide feedback, indicating which responses they prefer. This feedback informs subsequent fine-tuning.

Fine-tuning with RLHF

Benefits in Practice

  • Efficiency: Collecting user preferences is faster than crafting detailed prompts.
  • Quality Enhancement: The LLM can surpass human raters’ abilities by learning from user feedback.
  • Diverse Responses: The LLM isn’t constrained to fixed patterns; it adapts to individual preferences.
  • Comparison: Multiple model outputs can be evaluated simultaneously, leading to better recommendations.

Recapping Reinforcement Learning

RL is a machine learning paradigm where an agent learns to make optimal decisions by interacting with an environment. Unlike supervised learning (which relies on labelled data) or unsupervised learning, RL learns through trial and error. The agent takes actions, observes the environment’s responses, and receives rewards or penalties based on its actions. RL is employed when:

  • Explicit instructions are challenging to provide.
  • Learning from experiences (interactions with the environment) is more effective.
  • The goal is to maximize cumulative rewards over time.

RL can be expressed using Markov Decision Processes (MDP), which consists of:

  • State: Represents the environment’s current situation.
  • Action: Decisions taken by the agent.
  • Reward: Feedback received after each action.
  • Policy: Strategy for selecting actions.
  • Value Function: Estimates the expected cumulative reward.
  • Environment Dynamics: How the environment responds to actions.
Tic-tac-toe game example

The goal is to find an optimal policy that maximizes the value function. Generally, the value function is a measure of the success of the environment that we have modelled (e.g. customer satisfaction).

Q-Learning

Q-Learning is a model-free RL algorithm. It learns an action-value function (Q-values) that estimates the expected cumulative reward for each action in a given state. The Q-values are updated iteratively based on the Bellman equation:

Q(s, a) = Q(s, a) + α * (r + γ * max(Q(s’, a’)) — Q(s, a))

The exploration-exploitation trade-off is handled by the epsilon parameter. After training, the learned Q-values guide provided recommendations for each user. In this example, the agent (recommendation system) learns to suggest the best travel destination based on rewards (user satisfaction) and Q-values (action values).

Example in the travel industry

Imagine an RL-based travel recommendation system:

  • State: Represents the user’s context (e.g. time).
  • Action: The system suggests a travel activity (e.g. visit a city).
  • Reward Function: Assigns rewards based on user satisfaction (e.g., positive for enjoyable experiences, negative for disappointing ones).
  • Agent: The recommendation system.
  • Environment: The real-world travel context (e.g., city, attractions, weather).

Below we present a code snippet showcasing a basic implementation of an RL-based agent with Q-learning (sample code adapted from here). This agent is designed to learn optimal recommendations for both the destination city and the ideal time of travel for users. In this illustrative example, we utilize arrival dates as the state representation, as these directly impact the user’s experience in the destination city. For instance, factors such as weather conditions, which are linked to the time of the year, can influence the overall experience. The action space is defined by the recommended destination city.

import numpy as np
import random

# Define travel destinations and arrival dates
destinations = ["Porto", "Hannover", "Madrid"]
arrival_dates = ["2024-03-25", "2024-03-28", "2024-04-01"]
num_destinations = len(destinations)
num_dates = len(arrival_dates)

print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", num_dates)
print("Sample observation", np.random.choice(arrival_dates)) # Get a random observation

print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", num_destinations)
print("Action Space Sample", np.random.choice(destinations)) # Take a random action

# Create and Initialize the Q-table
state_space = num_dates
print("There are ", state_space, " possible states")

action_space = num_destinations
print("There are ", action_space, " possible actions")

# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
def initialize_q_table(state_space, action_space):
Qtable = np.zeros((state_space, action_space))
return Qtable

Qtable_travel = initialize_q_table(state_space, action_space)

# Define the epsilon-greedy policy
def epsilon_greedy_policy(Qtable, state, epsilon, destinations, arrival_dates):
# Randomly generate a number between 0 and 1
random_int = random.uniform(0,1)
# if random_int > greater than epsilon --> exploitation
if random_int > epsilon:
# Take the action with the highest value given a state
# np.argmax can be useful here
ix_state = arrival_dates.index(state)
action = destinations[np.argmax(Qtable[ix_state])]
# else --> exploration
else:
action = np.random.choice(destinations)

return action

# Training parameters
n_training_episodes = 10000 # Total training episodes
learning_rate = 0.7 # Learning rate

# Evaluation parameters
n_eval_episodes = 100 # Total number of test episodes

# Environment parameters
max_steps = 99 # Max steps per episode
gamma = 0.95 # Discounting rate
eval_seed = [] # The evaluation seed of the environment

# Exploration parameters
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.05 # Minimum exploration probability
decay_rate = 0.0005 # Exponential decay rate for exploration prob

# Define the reward function (considering weather and holidays)
def get_reward(destination, arrival_date):
base_rewards = {
"Porto": 10,
"Hannover": 6,
"Madrid": 9
}

# Adjust rewards based on arrival date (weather and holidays)
if destination == "Porto":
# Example: Increase reward during pleasant weather (spring/summer)
if "2024-03" <= arrival_date <= "2024-09":
return base_rewards[destination] + 5
elif destination == "Madrid":
# Example: Increase reward during San Isidro festival (May 15)
if arrival_date == "2024-05-15":
return base_rewards[destination] + 3

# Default reward (no specific adjustments)
return base_rewards[destination]

# Training the model
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, arrival_dates, destinations, max_steps, Qtable):
for episode in range(n_training_episodes):
# Reduce epsilon (because we need less and less exploration)
epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
# Reset the environment
state = np.random.choice(arrival_dates)
step = 0
done = False

# repeat
for step in range(max_steps):
# Choose the action At using epsilon greedy policy
action = epsilon_greedy_policy(Qtable, state, epsilon, destinations, arrival_dates)

# Take action At and observe Rt+1 and St+1
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state = np.random.choice(arrival_dates)

# Compute the reward
reward = get_reward(action, new_state)

# Get indexes
ix_action = destinations.index(action)
ix_state = arrival_dates.index(state)
ix_new_state = arrival_dates.index(new_state)

# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
Qtable[ix_state][ix_action] = Qtable[ix_state][ix_action] + learning_rate * (reward + gamma * np.max(Qtable[ix_new_state]) - Qtable[ix_state][ix_action])

# Our state is the new state
state = new_state
return Qtable

Qtable_travel = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, arrival_dates, destinations, max_steps, Qtable_travel)

# Trained Q-Learning table
print(f"Trained Q-Learning table:\n\n{Qtable_travel}\n\n")

# After training, recommend a destination and arrival date
best_dest_idx, best_date_idx = np.unravel_index(np.argmax(Qtable_travel), Qtable_travel.shape)
recommended_destination = destinations[best_dest_idx]
recommended_date = arrival_dates[best_date_idx]
print(f"Recommended destination: {recommended_destination} (Arrival date: {recommended_date})")

Python output:

_____OBSERVATION SPACE_____

Observation Space 3
Sample observation 2024–03–25

_____ACTION SPACE_____

Action Space Shape 3
Action Space Sample Hannover
There are 3 possible states
There are 3 possible actions
Trained Q-Learning table:

[[300. 291. 294.]
[300. 291. 294.]
[300. 291. 294.]]

Recommended destination: Porto (Arrival date: 2024–03–25)

More on RL

Remember, RL can be much more complex, but this simplified example illustrates the core concepts. For a more detailed tutorial, check out “An Introduction to Q-Learning: A Tutorial For Beginners” and “Reinforcement Learning from Scratch in Python with OpenAI Gym article”.

Fine-tuning LLMs with RLHF

RLHF training process (source)

Imagine a travel chatbot designed to assist users in planning their vacations. Traditionally, a LLM relies on predefined prompts to generate responses. For instance, if a user asks, “What are the best restaurants in Barcelona?”, the LLM might use a fixed prompt like: “Here are some top-rated restaurants in Barcelona: [list of restaurants].” However, with RLHF, the approach shifts. Here’s how it works:

1 — Initial Prompt and Model Response:

  • The user asks the LLM about Barcelona restaurants.
  • The LLM generates a response using its language model.
  • For example: “Here are some top-rated restaurants in Barcelona: [list of restaurants].”

2 — User Feedback Collection:

  • Instead of relying solely on the LLM output, the model presents multiple alternative responses to the user.
  • The user is asked to rank these responses based on their preferences.
  • For instance, the LLM might provide three different restaurant lists and ask the user to choose the one they find most helpful.

3 — Model Reweighting:

  • The user’s preference ranking serves as feedback.
  • Responses that align with the user’s preferences receive higher weights, while less favoured responses are down-weighted.

4 — Fine-Tuning and Adaptation:

  • The LLM is fine-tuned using the collected feedback.
  • The model learns to generate responses that better match user preferences.
  • For example, if the user consistently prefers lists with local, budget-friendly eateries, the LLM adapts to emphasize such recommendations.

5 — Improved User Experience:

  • Over time, the LLM becomes more attuned to user preferences.
  • It provides personalised and context-aware responses.
  • For instance, it might offer tailored restaurant suggestions based on the user’s dietary preferences, location, and budget.

Next destination

In this short introduction to enhancing LLMs, we’ve learned:

  • We reviewed prompt engineering as a basic technique and highlighted its limitations. Instruction tuning allows more customisation, but can still feel rigid.
  • How fine-tuning can be used to improve LLM’s capabilities toward specific tasks.
  • To make it as fun as going on holidays (at least we tried 😀), we’ve provided Python code examples so you can see these techniques in action.
  • RL with human judgment enables LLMs to evolve into more intuitive, responsive, and human-aligned tools. They can adapt to nuanced feedback instead of just static datasets

We have only begun to scratch the surface of their potential. Stay tuned as we do a deeper dive into RL for LLMs. This personalised approach is where the true magic lies in shaping powerful AI systems. More to come! ✈️

--

--

PhD Luis Dias
TUI Tech Blog

As an AI leader, I architect transformative tools, leveraging statistical techniques and cutting-edge Machine Learning. Join me in the AI revolution! 🚀