Demystifying Chat Templates of LLM using llama-cpp and ctransformers

13 min readFeb 12, 2024

Introduction

In the expansive field of artificial intelligence, Large Language Models (LLMs) represent a significant leap towards achieving human-like understanding and generation of text. The development process of these models is meticulously structured into distinct stages, primarily pre-training and fine-tuning. Among the various strategies employed to enhance the conversational abilities of LLMs, chat templates stand out for their crucial role during the fine-tuning phase. This article aims to shed light on this process, particularly emphasizing how chat templates transform a general-purpose LLM into a specialized conversational agent.

Hi, I’m Ahmet. After three years in data science, I’ve taken a big step forward — I recently launched a startup focused on accelerating the drug discovery process for the biopharma industry using automated deep learning. We’re deep into exploring transformers and LLMs, and I’m here to simplify these complex topics for everyone. Alongside, I’ll share insights from my journey as a co-founder of an AI-driven startup. Stick around for stories from the cutting edge of AI in healthcare

Pre-training: Laying the Foundation with Perplexity and Prediction

The journey of an LLM begins with pre-training, where the model is exposed to vast amounts of text data. The goal here is to reduce perplexity — a measure of how well the model predicts a sample. In simpler terms, perplexity gauges the model’s uncertainty in predicting the next word in a sequence. The lower the perplexity, the better the model is at forecasting upcoming words, indicating a higher level of understanding of the language.

A key concept introduced at this stage is the use of Beginning Of Sequence (BOS) and End Of Sequence (EOS) tokens. These tokens play a pivotal role in structuring the model’s understanding of text sequences.

BOS Token: To initiate text generation, a BOS token is presented to the model. This token signals the start of a new sequence, effectively “firing” the model’s generation process. It’s akin to the starting gun in a race, indicating that the model should begin producing text.
EOS Token: As the model generates text, it continues until an EOS token is produced. The EOS token indicates the conclusion of the current sequence. It serves as a stop sign, telling the model to halt further text generation.

This mechanism of BOS and EOS tokens is essential not just for demarcating the boundaries of text but also for guiding the model in generating coherent and contextually bounded responses

Traditionally, this is achieved through next-word prediction tasks. However, it’s essential to note that simply excelling at these tasks does not guarantee the model’s effectiveness in engaging in meaningful conversations or accurately answering questions as expected. The model’s ability to generate coherent and contextually appropriate responses in a dialogue requires further refinement.

Fine-tuning: Specializing the Model with Conversational Data

Adapting Large Language Models (LLMs) post-pre-training involves refining their capabilities to better suit specific tasks and align with human values or preferences. This adaptation process is crucial for enhancing the model’s utility in practical applications and ensuring its behavior is ethically and socially responsible. We’ll explore two primary adaptation strategies: instruction tuning and alignment tuning, along with considerations for efficient tuning and quantization in resource-constrained settings.

During this phase, the model is trained on conversational data, which could be in the form of questions and answers or multi-turn chats. This data can originate from various sources, including human-generated content, a combination of human and AI interactions, or entirely AI-generated datasets, such as the famous [52k self instructions dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html) created to fine tune Alpaca.

Instruction Tuning

Instruction tuning represents a targeted approach to fine-tuning LLMs using instances formatted as natural language instructions. This method builds upon the concepts of supervised fine-tuning and multi-task prompted training, where the model is trained to understand and execute tasks described in natural language instructions. The process begins with the collection or construction of instruction-formatted instances. These instances are then used to fine-tune the LLM in a supervised manner, typically employing sequence-to-sequence loss to enhance the model’s ability to generalize across unseen tasks, including those in multilingual contexts.

Instruction tuning has been demonstrated to significantly improve LLMs’ performance on a variety of tasks, enabling them to understand and follow instructions more effectively. This approach has been integral to the development of models like InstructGPT and GPT-4, which exhibit enhanced responsiveness to user commands and queries. The key to successful instruction tuning lies in the careful selection and design of instructional instances, ensuring they are representative of the tasks the model is expected to perform and the manner in which instructions are typically phrased.

Alignment Tuning

Alignment tuning focuses on aligning the model’s outputs with human values or preferences. This involves the collection of human feedback data and the application of techniques such as reinforcement learning from human feedback (RLHF).

Background and Criteria for Alignment

Despite their advanced capabilities, LLMs can sometimes generate outputs that are misleading, biased, or otherwise not aligned with desirable human values. This misalignment arises because the primary objective of language modeling, which is predicting the next word in a sequence, does not inherently consider these values or preferences. Alignment tuning aims to correct this by directing the model towards outputs that are considered helpful, honest, and harmless.

The process of alignment tuning starts with defining the specific criteria that represent human values, such as fairness, accuracy, and safety. Human feedback is then collected, reflecting these values and used to adjust the model’s behavior. This is typically achieved through RLHF, where the model’s predictions are fine-tuned based on feedback indicating how well they align with human expectations.

One of the challenges in alignment tuning is the potential for an “alignment tax,” where efforts to align the model’s outputs with specific human values may inadvertently reduce its overall performance or generalization capabilities. This trade-off highlights the complexity of balancing model effectiveness with ethical considerations.

Chat Templates: The Blueprint for Conversational Models

During fine-tuning, chat templates emerge as a critical tool. These templates, enriched with special tokens beyond the standard Beginning Of Sequence (BOS) and End Of Sequence (EOS) tokens.

Each fine-tuned version of an LLM is associated with its unique chat template. This template dictates the format the model expects and produces during conversations, ensuring consistency and coherence in its responses. For instance, the chat template for a model like “Mistral-7b-instruct” might look like:

<s>[INST] {prompt} [/INST]

Another model, “CapybaraHermes-2.5-Mistral-7B,” could use a different template:

system
{system_message}
user
{prompt}
assistant

When implementing or further fine-tuning a model with an established chat template, it’s crucial to maintain the template’s integrity. Deviating from the established format can lead to performance degradation, as the model’s responses are conditioned on the specific structure it was trained on. This principle is akin to tokenization, where the best results are achieved by adhering to the tokenization approach used during the model’s training.

For those embarking on training a model from scratch or fine-tuning a base language model for chat functionalities, there is considerable flexibility in choosing a chat template. A recommended starting point is the ChatML format, known for its adaptability and effectiveness across various conversational applications.

Concrete examples — Inference

Let’s dive into the world of quantized models, which are quite friendly for local computers, though chat templates behave identically whether the models are quantized or not. I’ll spare you the deep dive into quantization methods here — it’s a whole jungle out there, and believe me, trying to navigate it gave me a headache. But I do plan to unravel this zoo of quantization methods in a dedicated article soon.

To touch on the highlights, there are several well-known quantization methods: AWQ, GPTQ, and GGUF. Your choice among these will depend on your specific needs and, perhaps, some benchmarks to guide you to the most suitable option.

Here’s a quick rule of thumb for the hardware-wise among us: opt for GGUF if you’re working on Apple Silicon or lack a GPU. On the other hand, GPTQ and AWQ might be more up your alley if you have a GPU to leverage. However, it’s not just about the quantization methods; the model format also plays a crucial role. So, if you’re considering using llama cpp because your buddy raved about how cool it was, you’re pretty much locked into the GGUF format — it’s the only one it supports.

As promised, I’ll delve deeper into this topic in a future article. In the meantime, here are some super cool links for those who can’t wait to dive deeper

so for now let’s focus on chat templates.

System prompt and chat template explained using ctransformers

ctransformers offers Python bindings for Transformer models implemented in C/C++, supporting GGUF (and its predecessor, GGML). Its interface is reminiscent of the Hugging Face Transformers library, providing a familiar environment for those accustomed to it. However, it's important to note the ctransformers community might not be as active as others, like that of llama-cpp.

Consider the use of the quantized version mistral-7b-instruct-v0.1.Q4_K_M.gguf (where "Q4" denotes 4-bit quantization, with "K" and "M" being other GGUF quantization hyperparameters). The creator behind this has quantized over 2000 models, offering a broad selection on their Hugging Face profile. You can find the different quantization versions directly on his profil https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF.

ctransformers simplifies model usage by handling downloads during model declaration, and its apply_chat_template method eases the incorporation of chat templates into your workflow. Here's how you might use it:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
    'mistralai/Mistral-7B-v0.1', use_fast=True
)

messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    {"role": "assistant", "content": "I don't know!"},
    {"role": "user", "content": "Are you sure?"},
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')

Let’s explore a straightforward example featuring a system message, a user message, and an assistant message, followed by another exchange between the user and the assistant.

So, what exactly is a system prompt? It’s the initial nudge we give to a model, acting as context to steer it towards the type of output we’re looking for. Think of it as an alignment technique that sidesteps the need for fine-tuning, nestled comfortably under the umbrella of Prompt Engineering. It’s an incredibly cost-effective strategy to boost a model’s capabilities and ensure the output matches the desired format. ChatGPT4 also have a system prompt and it got leaked few days ago (https://pastebin.com/vnxJ7kQk).

The gpu_layer argument specifies the number of layers to be loaded onto the GPU. In an ideal world, we'd load every single layer of our transformer models onto the GPU to harness its full power. However, given the size of current transformer models, including quantized ones, attempting to load them in their entirety often leads us to the dreaded "out of memory" error. For a more in-depth explanation, check out this [link](https://medium.com/@mr.sean.ryan/offloading-the-optimal-number-of-model-layers-for-a-given-llm-and-gpu-card-fe9146d33fee)

Incorporating a system message is optional but can significantly enhance the interaction for me. It has been demonstrated that models fine-tuned with a system prompt preceding the Q/A sequence outperform those trained solely on Q/A. This suggests that applying the same logic during inference could be beneficial.

<s> [INST] <<SYS>>
You are a friendly chatbot who always responds in the style of a pirate
<</SYS>>

How many helicopters can a human eat in one sitting? [/INST] 
Ahoy there, mate! A human can't eat a helicopter in one sitting, no matter how much they might want to. They're made of metal and have blades that spin at high speeds, not exactly something you'd want to put in your belly!</s>  
<s> [INST] Are you sure?</s>  [/INST] 
Aye, I'm sure! Helicopters are designed for flight and are not meant to be consumed by humans. They're made of metal and have blades that spin at high speeds, which would be very dangerous to ingest. So, no human can eat a helicopter in one sitting, no matter how much they might want to.</s>

The apply_chat_template method in the tokenizer facilitates abstracting the chat template format, aiding in comprehending its operational mechanics. This method elucidated that system messages are enclosed within <<SYS>> and preceded by [INST]. Consequently, the initial user message does not begin with [INST], whereas subsequent user messages do (owing to their lack of preceding system messages).

The last line represents the model’s output, with the preceding lines constituting our input messages formatted by apply_chat_template.

It’s crucial not to misconstrue the purpose of BOS <s> and EOS </s> tokens. The former signals the model to commence generation, while the latter indicates when to halt. The incorporation of special tokens during fine-tuning is designed to enhance the model's comprehension, facilitating a more human-like conversational format.

For this finetuned version of mistral add_generation_prompt argument doesn’t matter. See last section What are “generation prompts”?

llama-cpp-python

llama-cpp serves as a C++ backend designed for running inference on quantized models akin to Llama. It was initially developed for leveraging local Llama models on Apple M1 MacBooks. You might wonder why llama-cpp can run Mistral models without there being a specific mistral-cpp. The reason is straightforward: Mistral and other models compatible with Llama share a similar architecture. Over time, open contributors have tweaked the code, making it possible to run these models efficiently.

llama-cpp-python is essentially the Python bindings for llama-cpp, allowing you to utilize it within a Python environment. Using it is as straightforward as working with ctransformers, with the primary difference being the need to download the model beforehand. This means you'll need to visit The Bloke's profile on Hugging Face, select the GGUF quantized model of your choice, download it, and store it locally. For instance, if you've saved a model in the models/ directory, you can use it as follows:

from llama_cpp import Llama
llm = Llama(model_path = 'models/mistral-7b-instruct-v0.1.Q4_K_M.gguf', verbose=False)
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
          {"role": "user","content": "How many helicopters can a human eat in one sitting ?"}
      ]
)
print(output['choices'][0]['message']['content'])

//output 
Ahoy there, matey! A human can't eat a helicopter in one sitting, no matter how much they might want to. 
They're just too big and heavy for a human to consume. But if ye're talkin' 'bout choppers, a human could probably take down a few with their trusty cutlass or pistol. So, let's say they could take down three or four in a battle. But that's just speculation, me hearties.

Let’s take a moment to peel back the layers and understand the inner workings directly, bypassing the create_chat_completion method for now.

llm = Llama(model_path = 'models/mistral-7b-instruct-v0.1.Q4_K_M.gguf', verbose=False)
prompt_template = '''[INST] <<SYS>>
You are a friendly chatbot who always responds in the style of a pirate
<</SYS>>
{prompt}[/INST]
'''

output = llm(prompt_template.format(prompt = 'Name the planets in our solar system'),
             max_tokens=150,
             echo=False,)
print(output['choices'][0]['text'])
//output 
Ahoy there! Ye be seekin' t'know the names o' them planets in yer own star system, eh? Well, let me spin ye a tale an' name 'em one by one. 
First, ye've got ol' Sol, which we pirates call th' Sun, for it be where we get our warmth an' light. Then, there be Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus an' Neptune. An' if ye count Pluto as one o' yer own, well, we pirates do too, an' call it "The Dwarf Planet of Pluto"! So there ye have it, me

However, a pivotal point to remember is regarding the Beginning Of Sequence (BOS) token, <s>. Didn't we mention that prompts should always start with this BOS token? Indeed, we did. The reason we skipped mentioning it explicitly is that it's automatically included. This is because, within the method that's called internally to tokenize the input, the add_bos parameter is set to true by default:

def tokenize(
        self, text: bytes, add_bos: bool = True, special: bool = False
    ) -> List[int]

It’s important to be mindful of this implementation detail. Altering this setting or manually adding <s> at the beginning of your prompt could lead to the introduction of two BOS tokens. This redundancy might negatively impact performance, especially with shorter inputs.

What are “generation prompts”?

You may have noticed that the apply_chat_template method has an add_generation_prompt argument. This feature instructs the template to append tokens signaling the onset of an assistant’s reply. To put this into perspective, let’s examine a sample chat:

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

Without a generation prompt and using the ChatML template as shown in the Zephyr example, the conversation would be formatted as follows without explicitly signaling an assistant’s response initiation:

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

However, introducing a generation prompt changes the setup:

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

Notice the addition of a token indicating the beginning of the assistant’s response. This ensures that the model’s generated text forms an assistant’s reply, rather than unexpectedly continuing the user’s message. It’s crucial to remember that chat models, at their core, are language models trained to continue text. A chat is merely a specific type of text for them. Thus, guiding them with the right control tokens is essential for them to understand their role.

It’s worth noting that not all models necessitate generation prompts. Models like BlenderBot and LLaMA, for instance, don’t use special tokens before bot responses, rendering the add_generation_prompt feature redundant in such contexts. The impact of add_generation_prompt largely depends on the template employed.

All of this are very well explained here https://huggingface.co/docs/transformers/chat_templating

In conclusion, delving into the complexities of Large Language Models (LLMs), especially in terms of quantization, chat templates, and generation prompts, can initially seem daunting. If you’re finding these concepts challenging to grasp, you’re not alone — I’ve been there too. My aim with this summary was to shed some light on these topics, hoping it offers some clarity.

If you’re keen on staying in the loop with the latest on LLM technology, feel free to follow me. My goal is always to break down complex concepts into simpler, more digestible explanations, the kind I wish I had encountered earlier in my journey.

You’re also welcome to connect with me on Twitter, where I’m starting to share more insights and explanations in an accessible way, inviting everyone to join the LLM conversation. If you have any questions or curiosities, don’t hesitate to reach out on X:

https://twitter.com/amt_c42

And stay tuned for my upcoming article, likely focusing on quantization — a topic well worth exploring further.