Fine-Tuning Mistral-7B: A Journey Through Literature and AI

Fine-Tuning Mistral 7B on a Quotes Dataset

Published in

Data Bistrot

34 min readJul 9, 2024

In this article, I’m taking you through a groundbreaking journey to fine-tune the Mistral-7B-v0.1 Large Language Model (LLM). This model stands at the cutting edge of generative text technology, boasting 7 billion parameters.

My endeavor is fueled by a dataset containing over 12,000 quotes, a treasure trove of wisdom spanning a diverse array of topics and styles. This collection serves as the perfect training ground for Mistral-7B, offering it a comprehensive view of the linguistic and thematic richness of thought and literature.

Our journey will cover the critical steps required to fine-tune Mistral-7B using thequotes dataset. From data preprocessing to setting up the training pipeline and finally, to the actual training process, we aim to refine Mistral-7B’s capabilities, enabling it to generate new quotes that echo the wisdom, humor, and profundity of their literary predecessors.

Mistral-7B architecture

Before we dive into the technicalities of fine-tuning, let’s unpack what makes the Mistral-7B LLM a new frontier of modern natural language processing (NLP) and generative AI.

“Mistral 7B” introduces is a 7-billion-parameter language model designed for superior performance and efficiency. It outperforms the best open 13-billion model (Llama 2) across all evaluated benchmarks and the best released 34-billion model (Llama 1) in mathematics and code generation. Key innovations include the use of grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to effectively manage sequences of arbitrary length, thus reducing inference cost.

The model (that we will use in this use-case) is also fine-tuned to follow instructions, outperforming Llama 2 in both human and automated benchmarks. It is released under the Apache 2.0 license.

Architectural details

Here’s a detailed summary of architectural details that you can find in the scientific paper on Mistral 7B:

Mistral 7B is built on a transformer architecture, which is a widely adopted framework in the field of natural language processing due to its effectiveness in handling sequence data.
Mistral 7B has 32 layers, 32 attention heads, and 8 key-value heads. The hidden size is 4096, and the intermediate size is 14336. The model’s maximum position embedding is 32768, and the vocabulary size is 32000. The model uses the silu activation function and bfloat16 data type. The sliding_window is 4096, and the rope_theta is 10000.

These values are not mystery, they're all explained in the paper.

Deep-Dive-Into-AI-With-MLX-PyTorch/deep-dives/001-mistral-7b/README.md at master ·…

"Deep Dive into AI with MLX and PyTorch" is an educational initiative designed to help anyone interested in AI…

github.com

Sliding Window Attention (SWA):

SWA is introduced to manage the computational complexity traditionally associated with transformer models, especially for long sequences. By allowing each token to attend to a fixed number of tokens in the preceding layer, SWA reduces the quadratic complexity of attention mechanisms to a more manageable form. This approach helps in processing sequences much longer than typical transformer models can handle.

https://ar5iv.labs.arxiv.org/html/2310.06825

This innovative approach allows for significant speed improvements and reduced computational costs.

Further reading on attention mechanisms

Introduction to Transformers and Attention Mechanisms

Explore the evolution, key components, applications, and comparisons of Transformers and Attention Mechanisms in deep…

medium.com

Rolling Buffer Cache

To further optimize memory usage, Mistral 7B employs a rolling buffer cache strategy. This approach involves using a cache of a fixed size equal to the window size. This method ensures that the cache size remains constant, even as the sequence length increases. This approch leads to an 8x reduction in cache memory usage without compromising model quality.

Pre-fill and Chunking:

The model utilizes a technique known as pre-fill and chunking for sequence generation, where the cache is pre-filled with known parts of the sequence (e.g., a prompt) to reduce computation. If the prompt is large, it is divided into smaller chunks, and the cache is filled chunk by chunk. This process allows for efficient memory usage and reduces the computational overhead associated with processing long sequences.

Grouped-Query Attention (GQA):

Grouped-Query Attention (GQA) is an advanced attention mechanism designed to enhance the efficiency and effectiveness of Transformers in processing large sequences of data. Traditional attention mechanisms in models like the original Transformer face scalability issues with long sequences due to their quadratic complexity with respect to the sequence length. This means that as the sequence length increases, the computational resources (both memory and processing power) required to compute the attention scores increase exponentially, making it difficult to work with long documents or high-resolution images.

GQA addresses this issue by grouping queries together before computing their attention scores with the keys and values. This approach reduces the computational complexity of the attention mechanism from quadratic to linear with respect to the sequence length, making it much more efficient, especially for long sequences.

These architectural innovations position Mistral 7B as a highly efficient and effective model, capable of handling long sequences with reduced computational costs and improved inference speeds.

The combination of SWA, rolling buffer cache, and pre-fill and chunking techniques contribute to the model’s superior performance, making it a notable advancement in the field of large language models.

Mistral 7B model was further enhanced to better follow instructions, culminating in a variant “Mistral 7B — Instruct.” This process aimed at augmenting the model’s generalization capabilities, especially in scenarios requiring adherence to specific instructions.

Mistral 7B — Instruct Fine-tuning:

The fine-tuning process for Mistral 7B — Instruct utilized publicly available instruction datasets from the Hugging Face repository. This approach emphasizes the adaptability of Mistral 7B, showcasing its capability to improve performance significantly through fine-tuning on specialized datasets without the need for proprietary data or complex training techniques.
The resulting Mistral 7B — Instruct model demonstrated superior performance compared to other models of similar size on MT-Bench, a benchmark for evaluating models on instruction-following capabilities. It also compared favorably to larger models, showcasing the effectiveness of instruction-focused fine-tuning.

Differences Between Mistral 7B and Mistral 7B — Instruct:

Base Model vs. Fine-tuned Model: The primary distinction between Mistral 7B and Mistral 7B — Instruct lies in the latter’s fine-tuning on instruction datasets. While Mistral 7B is a general-purpose language model with a strong baseline performance across various benchmarks, Mistral 7B — Instruct is specifically optimized to understand and execute instructions more effectively, making it more suited for tasks that require precise adherence to given directives.
Performance Improvements: Mistral 7B — Instruct outperforms the base model (and other similar models) in contexts where instructions must guide the generation process. This is particularly evident in the MT-Bench evaluations, where Mistral 7B — Instruct shows a marked improvement in understanding and responding to instructions.
Generalization Capability: The fine-tuning process enhances Mistral 7B’s ability to generalize from instructions, making Mistral 7B — Instruct more adept at tasks requiring nuanced comprehension of instructions and their context. This makes the fine-tuned model particularly valuable for applications such as chatbots, where the ability to follow complex and varied user instructions is crucial.

In conclusion, the fine-tuning process not only refines Mistral 7B’s ability to follow instructions but also significantly elevates its utility in real-world applications where nuanced comprehension and execution of instructions are necessary. This illustrates the power of targeted fine-tuning in enhancing the capabilities of foundational models like Mistral 7B.

Reference and furhter reading:

Mistral 7B

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency…

arxiv.org

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers…

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Mixture of Experts Explained

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The Significance of Fine-Tuning an LLM on Quotes

Why focus on quotes, you might wonder?

Quotes represent the essence of human experience, encapsulating wisdom, humor, and insight in a succinct, memorable manner. Through fine-tuning Mistral-7B on a dataset of scraped quotes, my objective is to endow the model with the ability to generate sentences that are not only impactful but also capable of inspiring, motivating, and providing reflection points for readers.

Join me on this captivating journey as we marry the insights of yesteryears with the technological advancements of today. My aim is to give an example of how to forge a unique model capable of generating insightful and original quotes, blending the legacy of the past with the potential of AI’s future. Through this article, I invite you to witness the fusion of literature’s depth with AI’s innovative prowess, a synthesis promising to usher in a new era of creativity and intellectual enrichment.

My Fine-tuning of Mistral 7B — instruct on Quotes Dataset

Let’s start exploring the process of fine-tuning. This code can run on google colab with a free GPU.

Install libraries and import

!pip install transformers accelerate scipy ipywidgets bitsandbytes peft datasets trl -qU
!pip install torch>=1.10
!pip install huggingface-hub -qU

# import libraries
from datetime import datetime
import pandas as pd
import json
import os
import re
import time
import requests
from urllib.request import urlopen
from urllib.error import HTTPError
import bs4
import pandas as pd
from math import ceil
from google.colab import userdata
from datasets import load_dataset

# Change the maximum column width to 100
pd.set_option('display.max_colwidth', 1000)

Load the dataset

This code snippet reads a CSV file into a pandas DataFrame, processes it by removing unnecessary columns and rows with missing values, and resets the DataFrame’s index.


file_name = '/content/best_quotes_en.csv'
df = pd.read_csv(file_name)
df.drop(columns=["is_english"], inplace = True)
df.dropna(inplace= True)
df.reset_index(inplace=True,drop=True)
df.head(5)

Here you can download a version of the quote files:
https://www.kaggle.com/datasets/gianpieroandrenacci/best-quotes-dataset

Split train-validation

This code snippet set the varibles for splitting a DataFrame into training and validation sets, with a specified training set proportion.

# split the data
train_perc = 0.8
len_df = len(df)
train_len = ceil(len_df * train_perc)
val_len = len_df - train_len

print(f"Total number of samples: {len_df}")
print(f"Training set size: {train_len}")
print(f"Validation set size: {val_len}")

In my case:

Total number of samples: 12344
Training set size: 9876
Validation set size: 2468

Converting DataFrame to Hugging Face Dataset

from datasets import Dataset

# omits the DataFrame's index in the resulting Dataset
dataset = Dataset.from_pandas(df, preserve_index=False)

# Display basic information about the dataset
print("Dataset structure and column types:")
print(dataset)

# Preview the first few entries in the dataset to understand its contents
print("\nPreview of the first few records:")
print(dataset[:2])

# Splitting the dataset into training and validation sets
dataset_train = dataset.select(range(train_len))  # Select the first portion as training data
dataset_val = dataset.select(indices=range(train_len, len(dataset)))  # Select the remaining portion as validation data

# Print detailed information about the training and validation datasets
print("Training Dataset Info:")
print(dataset_train)
print("\nValidation Dataset Info:")
print(dataset_val)

The code snippet below achieves the following steps:

Import Dataset Class: Initially, the Dataset class is imported from the datasets module, which is part of the Hugging Face's datasets library.
Conversion to Dataset: We convert a pandas DataFrame (df) into a Hugging Face Dataset object using the Dataset.from_pandas(df, preserve_index=False) method. The preserve_index=False argument ensures that the DataFrame's index is not included as a separate column in the resulting dataset. This is useful to keep the dataset clean and focused only on the data's content.
Inspecting Dataset Structure: To understand the structure of the converted dataset, including the types of data it contains, we print out its schema using print(dataset). This provides insights into the dataset's columns and their respective data types.
Previewing Dataset Contents: Finally, to get a concrete idea of what the data looks like, we preview the first two entries of the dataset using print(dataset[:2]). This helps in verifying the data conversion and understanding the dataset's initial entries.

This process is critical for preparing your data for modeling, especially when working with the Hugging Face ecosystem for NLP tasks.

Function: `create_prompt`

def create_prompt(row, output_format='string', bos_token="<s>", eos_token="</s>"):
    """
    Generate a formatted prompt using input data, with options for string or dictionary output.

    Parameters:
        row (dict): A dictionary containing the 'quote' and 'tags' to be included in the prompt.
        output_format (str, optional): The format of the output; 'string' or 'dict'. Defaults to 'string'.
        bos_token (str, optional): The beginning-of-sequence token. Defaults to "<s>".
        eos_token (str, optional): The end-of-sequence token. Defaults to "</s>".

    Returns:
        str or dict: The generated prompt in the specified format.
    """
    system_message = "[INST]Use the provided input to create a quote about a topic category[/INST]"
    response = row.get("quote", "")
    input_tags = row.get("tags", "")

    full_prompt = f"{bos_token}{system_message} ### Input: {input_tags} ### Response:{response}{eos_token}"

    if output_format == 'dict':
        return {"prompt": full_prompt}
    return full_prompt

The create_prompt function dynamically constructs prompts for language model generation tasks, incorporating both the content and context of the input data. It's designed to support flexible output formats and customizable sequence tokens, making it suitable for a variety of NLP and Generative AI applications.

Process:

System Message: A predefined instructional message is included to guide the language model’s generation based on the input.
Prompt Construction: Combines the beginning-of-sequence token, system message, input tags, the quote (response), and the end-of-sequence token into a single, formatted prompt.
Output: Based on the output_format parameter, the function returns the prompt as either a plain string or within a dictionary.

Usage Scenario:

This function is particularly useful in deep learning workflows where generating contextual prompts for language models is necessary, such as for tasks involving creative writing, summarization, or content generation based on specific topics or themes.

The tokenizer automatically adds the necessary special tokens according to the model’s requirements. These special tokens often include tokens for the start and end of the sequence, which are essential for the model to understand the beginning and termination of the input text.

For models like BERT, special tokens such as [CLS] (for the start of the sequence) and [SEP] (for separation or end of the sequence) are automatically inserted by the tokenizer. Similarly, for models trained with the and tokens as their beginning-of-sequence and end-of-sequence markers, the tokenizer will handle their insertion when add_special_tokens=True is specified. This means you don’t need to manually enclose your prompt in and tags; the tokenizer will do it for you.

# call example
create_prompt(dataset_train[0], output_format='string')

# Train dataset - Using a lambda function to apply create_prompt with the desired output format 'dict'
instruct_tune_dataset_train = dataset_train.map(lambda row: create_prompt(row, output_format='dict'))
# Test Dataset - Using a lambda function to apply create_prompt with the desired output format 'dict'
instruct_tune_dataset_val = dataset_val.map(lambda row: create_prompt(row, output_format='dict'))

Load the base model

This code snippet sets up a model and tokenizer for a language model from the Hugging Face Transformers library, configures quantization settings, and determines the appropriate device and data type for model execution.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Retrieve the Hugging Face API token from user data for accessing models
access_token = userdata.get('HF_TOKEN')

# Check if a CUDA-enabled GPU is available; use GPU if available, otherwise use CPU
device_type = "cuda:0" if torch.cuda.is_available() else "cpu"

# Set the tensor data type based on the device: use float16 for GPU (faster, more memory-efficient), otherwise use float32
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

Imports: Load necessary libraries for model and tokenizer setup.
Access Token: Retrieve the Hugging Face token for model access.
Device Type: Determine whether to use GPU or CPU.
Tensor Data Type: Set the appropriate data type for tensors based on the device.

Configuring and Loading a Quantized Model with BitsAndBytes

This section outlines the process of configuring and loading a large language model with enhanced efficiency through quantization, showcasing how to utilize the BitsAndBytes library for optimal performance.

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

# Define the configuration for BitsAndBytes with specific quantization and compute settings
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,  # Enable loading the model in 4-bit precision
   bnb_4bit_quant_type="nf4",  # Use the "nf4" quantization type
   bnb_4bit_use_double_quant=True,  # Enable double quantization for improved precision
   bnb_4bit_compute_dtype= torch_dtype # Set compute data type for efficiency
)

# Specify the model ID for the Mistral-7B-Instruct-v0.1 model
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

# Load the model with the defined configuration and additional settings for device allocation and caching
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_type,  # Automatically allocate the model across available devices
    quantization_config=nf4_config,  # Apply the defined BitsAndBytes configuration
    use_cache=False,  # Disable caching for dynamic computation
    token=access_token,  # Securely pass the Hugging Face access token
)

Configuration with BitsAndBytes:

Quantization to 4-bit Precision: By setting load_in_4bit=True, the model is configured to load in 4-bit precision, significantly reducing memory requirements.
Quantization Type (nf4): The bnb_4bit_quant_type is set to "nf4", specifying the quantization method to use, which in this case is tailored for neural network computations.
Double Quantization: Enabled via bnb_4bit_use_double_quant=True for higher precision in the quantized model.
Compute Data Type: The bnb_4bit_compute_dtype is set based on a variable torch_dtype, allowing for customized computational efficiency.

Model Loading:

Model ID: The variable model_id holds the identifier for the Mistral-7B-Instruct-v0.1 model, enabling its retrieval.
Device Allocation: With device_map=device_type, the model is automatically distributed across available computing resources, enhancing performance.
Quantization Configuration: The previously defined nf4_config is applied to the model, activating the quantization settings.
Cache Management: Caching is disabled (use_cache=False) to prioritize dynamic computation and memory efficiency.
Authentication: Access to the model is secured using an access token (token=access_token), ensuring protected access to resources.

This approach demonstrates how to leverage advanced configuration options for optimizing model loading and operation, particularly in resource-constrained environments or when handling very large models.

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers…

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Setting Up the Tokenizer for Mistral-7B-v0.1

The setup process involves loading the tokenizer associated with the Mistral-7B-v0.1 model and configuring it for optimal performance in processing text data. This configuration ensures uniformity in sequence length and proper handling of longer inputs.

from transformers import AutoTokenizer

# start and end token
bos_token = "<s>"
eos_token = "</s>"

# Define maximum sequence length for tokenization
max_length = 512

# Load the tokenizer for the Mistral-7B-v0.1 model with specific settings
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1",  # Model identifier
    model_max_length=max_length,  # Set maximum model input length
    padding_side="right",  # Ensure padding is applied on the right side
    token=access_token,  # Use secure token for authentication
    truncation=True,  # Enable truncation to max length
    padding="max_length",  # Apply padding to max length for uniformity
)

# Post-initialization adjustments to tokenizer settings
# Check and set special tokens if they are not already defined
if tokenizer.eos_token is None:
    tokenizer.eos_token = eos_token  # Set end-of-sequence token
    print("EOS token set.")

if tokenizer.bos_token is None:
    tokenizer.bos_token = bos_token  # Set beginning-of-sequence token
    print("BOS token set.")

# Use the beginning-of-sequence token as padding token
tokenizer.pad_token = tokenizer.bos_token

print(tokenizer)

These steps are pivotal for preparing text data for processing with the Mistral-7B-v0.1 model, ensuring that all inputs conform to the model’s requirements for efficient and accurate natural language understanding tasks.

Print Output:

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<s>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), }

Key Configuration Steps:

Maximum Sequence Length: A limit is set (max_length = 512) to standardize the length of all input sequences. This cap helps manage memory usage and computational load during tokenization and model inference.
Tokenizer Loading: The AutoTokenizer class from the transformers library is used to load the tokenizer with a specific model identifier ("mistralai/Mistral-7B-v0.1"). This step ensures that the tokenizer is perfectly matched with the model's expected input format.

Padding and Truncation:

Right-Side Padding: By setting padding_side="right", all sequences shorter than max_length are extended with padding tokens on the right side, ensuring consistent sequence lengths.
Truncation: The truncation=True option automatically shortens sequences longer than max_length, preventing input overflow errors.
Padding to Maximum Length: The padding="max_length" setting applies padding to all sequences to reach the defined max_length, guaranteeing uniform input size.
Secure Authentication: The tokenizer is authenticated using a secure token (token=access_token), which is essential for accessing private models or premium features on the Hugging Face platform.

The following code snippet defines and uses a function to plot the distribution of the lengths of tokenized input sequences from training and validation datasets.

import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    # Calculate the lengths of input_ids for each sample in the training dataset
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    # Append the lengths of input_ids for each sample in the validation dataset
    lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
    
    # Print the total number of samples
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(10, 6))  # Create a new figure with specified size
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')  # Plot histogram with 20 bins
    plt.xlabel('Length of input_ids')  # Set x-axis label
    plt.ylabel('Frequency')  # Set y-axis label
    plt.title('Distribution of Lengths of input_ids')  # Set plot title
    plt.show()  # Display the plot

# Call the function with tokenized training and validation datasets
plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

This function helps visualize the distribution of the lengths of tokenized input sequences, which is useful for understanding the dataset’s characteristics and ensuring appropriate model input handling.

Function: `generate_and_tokenize_prompt`

The generate_and_tokenize_prompt function streamlines the preparation of dataset entries for the model by generating prompts from input data and then tokenizing these prompts. This process is crucial for models that require tokenized input for training or inference.

def generate_and_tokenize_prompt(row):
    """
    Generate a prompt from a row and tokenize the generated prompt.

    Parameters:
        row (dict): A dictionary containing the data needed to generate the prompt.

    Returns:
        The tokenized prompt.
    """
    prompt = create_prompt(row)  # No need to specify output_format='string' as it's the default
    return tokenizer(prompt)

tokenized_train_dataset = dataset_train.map(generate_and_tokenize_prompt)
tokenized_val_dataset = dataset_val.map(generate_and_tokenize_prompt)

Process Overview:

Generate Prompt: For each entry in the dataset (represented as a dictionary row), a prompt is generated using the create_prompt function. The function utilizes data within row to construct a text prompt. The default output format is a string, which suits the requirements for tokenization.
Tokenize Prompt: The generated string prompt is then passed to a tokenizer (tokenizer(prompt)), which converts the text into a format that's understandable by machine learning models, typically a sequence of integers representing tokens in the tokenizer's vocabulary.

Batch Tokenization with `.map` Method:

Tokenized Training Dataset: dataset_train.map(generate_and_tokenize_prompt) applies the generate_and_tokenize_prompt function to each entry in the dataset_train. This results in a new dataset where each original row is replaced with its tokenized prompt version.
Tokenized Validation Dataset: Similarly, dataset_val.map(generate_and_tokenize_prompt) transforms the validation dataset, ensuring that both training and validation datasets are in the correct format for model consumption.

Prompt generation and tokenization in LLM NLP Workflows

By automating prompt generation and tokenization, this method significantly streamlines the preprocessing pipeline for NLP LLM tasks, enabling more efficient model training and evaluation.

This approach is particularly valuable in NLP workflows for several reasons:

Consistency: Ensures uniform processing and tokenization across all dataset entries.
Efficiency: Leveraging the .map method for batch processing accelerates the preparation of large datasets.
Compatibility: Prepares data in a form that’s directly compatible with models, facilitating smoother training and evaluation processes.

The generate_response function is designed to utilize a pre-trained model and tokenizer from the Transformers library to generate a text response based on a given input prompt. This function encapsulates the process of input preparation, model inference, and output processing, making it a versatile tool for applications like chatbots, automated writing assistants, or any scenario where generating human-like text from prompts is required.

def generate_response(prompt, model, tokenizer):
    """
    Generates a response to a given prompt using the specified model and tokenizer.

    Parameters:
        prompt (str): The input text prompt to generate a response for.
        model (transformers.PreTrainedModel): The pre-trained model used for generating the response.
        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for preprocessing the prompt and decoding the model's output.

    Returns:
        str: The generated response text.
    """
    # Encode the prompt into model input format
    encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
    # Move model input to the GPU
    model_inputs = encoded_input.to('cuda')

    # Generate a response using the model
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=1000,  # Limit the maximum number of new tokens generated
        do_sample=True,       # Enable sampling to generate diverse responses
        pad_token_id=tokenizer.eos_token_id  # Use the EOS token for padding during generation
    )

    # Decode the generated response tokens to text
    decoded_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    # Remove the original prompt from the output to return only the generated response
    response = decoded_output[0].replace(prompt, "")
    return response

Here’s a breakdown of its workflow:

Input Prompt Tokenization: The function starts by tokenizing the input prompt using the provided tokenizer. This step converts the prompt from a string of text into a format that the model can understand (a tensor of token IDs), including the addition of special tokens that signify the start and end of sequences.
GPU Allocation: The tokenized input is then moved to the GPU to leverage accelerated computing. This step is crucial for performance, especially when working with large language models, as it significantly reduces the time taken for inference.
Response Generation: With the input prepared, the function calls the model’s generate method to produce a response. This method takes several parameters to control the generation process:
max_new_tokens limits the number of new tokens that the model can generate, preventing excessively long outputs.
do_sample enables probabilistic sampling of the next token, allowing for more varied and natural responses.
pad_token_id specifies the token used for padding, ensuring the model handles varying input lengths appropriately.
Decoding and Cleaning: After generation, the output token IDs are decoded back into text. The function then removes the original prompt from this text, ensuring that the returned response contains only the newly generated content.

The result is a standalone piece of text generated by the model in response to the input prompt, demonstrating the model’s ability to understand and continue the given text in a coherent and contextually appropriate manner.

tokenizer(prompt, return_tensors=”pt”, add_special_tokens=True)

This line of code uses the tokenizer to convert a text prompt into a format that is suitable for input to a transformer model.

prompt: This is the text input that you want the model to respond to or analyze.

return_tensors=”pt”: Specifies that the output should be in the form of PyTorch tensors. The “pt” stands for PyTorch. If you were using TensorFlow, you might use “tf” instead.

add_special_tokens=True: Tells the tokenizer to automatically insert special tokens that are necessary for the model to understand the start and end of the sentence, among other structural elements. For example, in many models, [CLS] tokens at the beginning of sentences are use.

Example of response with the non-finetuned model:

# Example prompt

prompt = "[INST]Use the provided input to create a new original quote about \
topic category[/INST] ### Input:ambition ### Response:"
# Generating a response
generate_response(prompt, model, tokenizer)

Output: “Ambition is the ultimate fuel that propels us to achieve greatness in life.”

Train the model Mistral-7B

Trainable parameters

The function print_trainable_parameters is designed to compute and print out statistics about the parameters of a given neural network model, specifically focusing on how many of these parameters are trainable.

# re-enable for inference!
model.config.use_cache = False

# print the trainable parameters
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Here's a detailed explanation of how it works:

Overview: In deep learning models, parameters (often weights) are the parts of the model that are learned from the training data. Parameters can be either trainable or non-trainable. Trainable parameters are updated during training via backpropagation, whereas non-trainable parameters remain static and are not updated during this process.

Function Process:

The function initializes two counters: trainable_params and all_param, to keep track of the total number of trainable parameters and the total number of parameters in the model, respectively.
It iterates over all parameters of the model, obtained by calling model.named_parameters(). This method returns an iterator of tuples, where each tuple contains the name of the parameter and the parameter itself.
For each parameter, the function uses param.numel() to get the total number of elements in the parameter tensor, which corresponds to the number of parameters in that tensor. This number is added to all_param to keep a running total of all parameters in the model.
It checks if the parameter is trainable by examining param.requires_grad. If requires_grad is True, the parameter is considered trainable, and its number of elements is added to trainable_params.
After iterating through all parameters, the function calculates the percentage of parameters that are trainable by dividing trainable_params by all_param and multiplying by 100.
Finally, it prints out the total number of trainable parameters, the total number of parameters, and the percentage of parameters that are trainable.

This function is particularly useful for understanding the complexity and capacity of a model, as well as the scope of learning that occurs during training. By distinguishing between trainable and non-trainable parameters, one can also gain insights into how different parts of the model contribute to the learning process.

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-Efficient Fine-Tuning (PEFT) enable efficient adaptation of pre-trained models to downstream applications without fine-tuning all the model’s parameters. PEFT supports the widely-used Low-Rank Adaptation of Large Language Models (LoRA).

from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM"
)

The code snippet provided is related to configuring a model for parameter-efficient fine-tuning (PeFT) using the peft library. Specifically, it's focused on setting up LoRA (Low-Rank Adaptation), a technique for adapting large pre-trained models with a minimal increase in the number of trainable parameters. Here's a breakdown of the key components and what they represent:

Import Statement: The code imports AutoPeftModelForCausalLM, LoraConfig, get_peft_model, and prepare_model_for_kbit_training from the peft library. These utilities and classes are designed to facilitate the use of PeFT techniques, such as LoRA, with causal language models.
LoRA Configuration (LoraConfig): The configuration for LoRA is defined through LoraConfig, specifying how the adaptation should be applied. The parameters within LoraConfig have specific roles:
r: This parameter specifies the rank of the adaptation matrices. A lower rank means fewer parameters and less capacity for adaptation, but it's more parameter-efficient.
lora_alpha: This determines the scale of the adaptation. A higher lora_alpha value increases the impact of the LoRA parameters on the adapted model's output.
target_modules: A list of module names within the transformer model where LoRA adaptation should be applied. Common targets include the projection layers (q_proj, k_proj, v_proj, o_proj) involved in attention mechanisms, as well as layers specific to the model's architecture like lm_head for language models.
bias: Specifies how biases should be treated during the adaptation process. Setting it to "none" indicates that biases are not adapted.
lora_dropout: The dropout rate applied to the LoRA parameters. Dropout is a regularization technique to prevent overfitting by randomly setting a proportion of the parameters to zero during training.
task_type: Defines the type of task the model is being adapted for, which in this case is "CAUSAL_LM" (causal language modeling). This setting helps tailor the adaptation process to the specifics of the task.

The setup using LoraConfig is part of preparing a pre-trained model for fine-tuning on a specific task, where the goal is to make the model better fit the task-related data while modifying a minimal number of parameters. This approach can be particularly useful when working with very large models, where traditional fine-tuning might be computationally expensive or when the amount of task-specific data is limited.

Now we can execute the defined functions:

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
print_trainable_parameters(model)

1. model.gradient_checkpointing_enable():

This method enables gradient checkpointing for the model. Gradient checkpointing is a technique used to reduce memory usage during training by trading off computational time. Instead of storing all intermediate activations in memory for the backward pass, it stores only a subset and recomputes the others as needed. This is particularly useful for training large models or using very deep networks that would otherwise exceed available memory.

2. model = prepare_model_for_kbit_training(model):

Prepares the model for training with a reduced precision, specifically “k-bit” training. This process involves modifying the model to support operations that use fewer bits than standard 32-bit floating-point numbers, potentially reducing memory usage and accelerating computation. However, the exact nature of the preparation can depend on the implementation and might involve quantization techniques or adjustments to the model’s layers to support lower precision arithmetic.

3. model = get_peft_model(model, peft_config):

This function adapts the given model according to the Parameter-Efficient Fine-Tuning (PeFT) configuration provided (peft_config) as we have seen above.

4. print_trainable_parameters(model):

Calls a function that prints out the total number of parameters in the model, the number of trainable parameters, and the percentage of parameters that are trainable as we have seen above.

Overall, these steps are part of an optimization and fine-tuning workflow designed to make large pre-trained models more adaptable and efficient for specific tasks without the need for extensive computational resources. This approach is particularly valuable when dealing with constraints on training time, memory, or when fine-tuning very large models that have billions of parameters.

Output:

trainable params: 21260288 || all params: 3773331456 || trainable%: 0.5634354746703705

# Model Parameter Summary
# Trainable Parameters: 21,260,288
# These are the parameters that will be updated during the training process.

# Total Parameters: 3,773,331,456
# This includes both trainable and non-trainable parameters in the model.

# Trainable Percentage: 0.56%
# This indicates that only 0.56% of the total parameters in the model are trainable.

Reference:

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Let’s continue; we are almost there to conclude our long explanation.

The following code is useful for optimizing the training or inference of large models by leveraging multiple GPUs to distribute the computational load.

if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

Multiple GPUs Check: The code checks if there are more than one GPU available.
Enable Parallelism: If multiple GPUs are available, it sets the model to be parallelizable.
Activate Parallelism: It then activates model parallelism to utilize multiple GPUs for model operations.

At this point, sometimes can happen that colab does not work with error NotImplementedError: A UTF-8 locale is required. I used the following code when colab gives this kind of error:

import locale
locale.getpreferredencoding = lambda: "UTF-8"

Why are Google Colab shell commands not working?

Steps to reproduce: Open new Colab notebook on GPU !ls #works !pip install -q turicreate import turicreate as tc !ls…

stackoverflow.com

If you want to track your experiment you can use Weights and Biases. With Weights and Biases you can train and fine-tune models, manage models from experimentation to production, and track and evaluate LLM applications.

Weights & Biases: The AI Developer Platform

The Weights & Biases MLOps platform helps AI developers streamline their ML workflow from end-to-end.

wandb.ai

This setup is essential for tracking and visualizing this experiments, ensuring reproducibility, and collaborating with others.

!pip install -q wandb -U

import wandb, os
wandb.login()

wandb_project = "quotes-finetune"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

Install Weights & Biases: The code installs the wandb library for experiment tracking.
Authenticate: It logs into the Weights & Biases account. wandb.login(): Initiates the login process for Weights & Biases, allowing the user to authenticate and connect to their Weights & Biases account.
Set Project Name: It sets the project name for organizing experiments in Weights & Biases.

run_name = "quotes-finetune"

from huggingface_hub import notebook_login

notebook_login()

Notebook Login: The code first uses notebook_login from huggingface_hub to authenticate the user directly within the notebook environment.

RUN Test Fine-Tuning for Mistral 7-B

Before proceeding with a full training session, it is advisable to conduct a preliminary test run with a few training steps. This initial test helps ensure that the fine-tuning process is set up correctly and can identify potential issues early, saving time and resources.

Benefits of a Test Run

Validation of Setup: Verifies that the training environment, data loading, and model configuration are functioning as expected.
Early Detection of Errors: Identifies any errors or misconfigurations in the training script before committing to a longer training session.
Performance Monitoring: Allows for monitoring of model performance and resource usage on a smaller scale, helping to adjust parameters and settings accordingly.

By conducting a short test with only a few training steps, you can fine-tune your setup and make necessary adjustments to ensure a smooth and efficient full training process.

from transformers import TrainingArguments
# Change bf16 to fp16 for non Ampere GPUs
args = TrainingArguments(
  output_dir = "./mistral_instruct_generation_V2",
  #num_train_epochs=1,
  max_steps = 4, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  gradient_accumulation_steps=4,
  gradient_checkpointing=True,
  warmup_steps = 1,
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=1, # comment out this line if you want to evaluate at the end of each epoch
  save_steps=1,
  learning_rate=2e-4,
  fp16=True,
  lr_scheduler_type='constant',
  logging_steps=50,
  logging_dir="./logs",
  save_strategy="steps",   # Save the model checkpoint every logging step
  do_eval=True,    # Perform evaluation at the end of training
  neftune_noise_alpha=5,
  push_to_hub=False,
   report_to="wandb", # Comment this out if you don't want to use weights & baises
   run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
 # optim="paged_adamw"
)

# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Hyperparameters explained

This code snippet is configuring training arguments for fine-tuning a model using the Hugging Face Transformers library. The TrainingArguments class is designed to specify various settings and hyperparameters for the training process. Here's an explanation of the key arguments provided:

output_dir: The directory where the outputs (like model checkpoints) will be saved. In this case, it's set to "./mistral_instruct_generation_V2".
max_steps: The total number of training steps. The model will stop training after reaching this number of steps. This is set to 4 as an example; in practice, you might need a much higher number depending on the dataset size and the desired model performance.
per_device_train_batch_size: The batch size per device during training. It's set to 4, meaning each GPU/CPU will process 4 examples per batch.
gradient_accumulation_steps: The number of steps to accumulate gradients before performing a backward/update pass. A higher number can be useful for effectively increasing the batch size, especially when the hardware limitations prevent using a larger batch size directly.
gradient_checkpointing: Enables gradient checkpointing to save memory, allowing for training with larger models, sequences, or batch sizes.
warmup_steps: The number of steps used for the warm-up phase of the learning rate scheduler. It's set to 1 here, but often a larger number is used to gradually increase the learning rate from 0 to the initial learning rate.
evaluation_strategy: Determines when to evaluate the model on the validation set. Setting this to "steps" means evaluation will occur every specified number of steps.
eval_steps: The number of training steps between evaluations. Set to 1 for frequent evaluations, useful for short training runs or debugging.
save_steps: The model checkpoint is saved every specified number of steps. Set to 1 here for frequent saving.
learning_rate: The initial learning rate for training. Set to 2e-4, a common choice for fine-tuning transformer models.
fp16: Enables mixed precision training using float16 instead of float32, which can reduce memory usage and potentially speed up training, depending on your GPU.
lr_scheduler_type: The type of scheduler for the learning rate. Set to 'constant' for a consistent learning rate throughout training.
logging_steps: The frequency of logging training information. Set to 50 to log information every 50 steps.
logging_dir: The directory where logs will be saved, set to "./logs".
save_strategy: Determines the strategy for saving model checkpoints. Here, it's set to "steps", aligning with the save_steps setting.
do_eval: A flag to perform evaluation at the end of training.
neftune_noise_alpha: NEFTune add noise to embedding vectors during training. This simple yet effective strategy significantly improves model finetuning, particularly on instruction-based tasks.
push_to_hub: A boolean indicating whether to push the model and training artifacts to Hugging Face's Model Hub. Set to False to avoid uploading.
report_to: Specifies the integration for logging metrics and training information. "wandb" indicates that Weights & Biases will be used for tracking.
run_name: Specifies the name of the run in Weights & Biases, useful for distinguishing between different training sessions.

This configuration sets up a detailed training regime tailored to the specific needs of the task, including hyperparameters like learning rate, batch size, and strategies for evaluation and saving model checkpoints. It also incorporates practices like gradient accumulation and mixed precision training to optimize for performance and resource utilization.

NEFTune is a technique to boost the performance of chat models and was introduced by the paper “NEFTune: Noisy Embeddings Improve Instruction Finetuning” . It consists of adding noise to the embedding vectors during training. According to the abstract of the paper:

Standard finetuning of LLaMA-2–7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

https://huggingface.co/docs/trl/sft_trainer

https://github.com/neelsjain/NEFTune

Real Training scenario

The question of whether the parameters are "optimized" depends significantly on the specific task, the model being fine-tuned, the size and nature of the dataset, and the computational resources available. Here's a brief overview of how each parameter can impact training and considerations for optimization:

max_steps = 250: This determines the total number of training steps. Whether this is optimal depends on the complexity of the task and the dataset size. For many tasks, more steps will be required to achieve convergence.
per_device_train_batch_size = 4: The batch size per device. This needs to be balanced against the available memory of the GPU. Larger batch sizes can lead to faster convergence but require more memory.
gradient_accumulation_steps=4: This allows effectively larger batch sizes without the additional memory cost, by accumulating gradients over several steps before updating model weights. The optimal value again depends on the dataset and model size.
gradient_checkpointing=True: Helps reduce memory usage by trading compute for memory, enabling training of larger models or longer sequences than would otherwise be possible.
warmup_steps = 5: The number of steps to increase the learning rate from 0 to the initial learning rate. The optimal number can vary, with larger numbers often used to prevent early training instability.
evaluation_strategy="steps" and eval_steps=50: Determines how often to evaluate the model on the validation set. The optimal frequency of evaluation depends on the training duration and how quickly you expect the model to improve.
save_steps=50: This is how often to save a model checkpoint. Frequent saving can be useful for long trainings where you want to ensure progress is not lost, but it can also generate a large number of files.
learning_rate=2e-4: The learning rate for training. Finding the optimal learning rate is crucial and often requires experimentation or techniques like learning rate finders.
fp16=True: Enables mixed precision training, which can significantly speed up training and reduce memory usage on compatible GPUs.
lr_scheduler_type='constant': Using a constant learning rate is a simple approach, but other strategies might lead to better results, depending on the task.
do_eval=True: Whether to perform evaluation. Continuous evaluation helps monitor progress but can slow down training.
push_to_hub=True: Automatically pushing to Hugging Face's Model Hub is convenient for sharing models but might not be desired in all scenarios.
report_to="wandb": Integration with Weights & Biases is excellent for tracking experiments, but optimal use requires setting up and following a consistent experiment tracking strategy.

Optimization of these parameters typically requires understanding the trade-offs involved and might involve hyperparameter tuning techniques or experiments. It’s also crucial to monitor metrics beyond just loss, such as accuracy or F1 score, to ensure that the model is genuinely improving in ways that matter for the task at hand.

from transformers import TrainingArguments
# Change bf16 to fp16 for non Ampere GPUs
args = TrainingArguments(
  output_dir = "./mistral_instruct_generation_V2",
  #num_train_epochs=1,
  max_steps = 250, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  gradient_accumulation_steps=4,
  gradient_checkpointing=True,
  warmup_steps = 5,
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=50, 
  save_steps=50,
  learning_rate=2e-4,
  fp16=True,
  lr_scheduler_type='constant',
  logging_steps=50,
  logging_dir="./logs",
  save_strategy="steps",   # Save the model checkpoint every logging step
  do_eval=True,    # Perform evaluation at the end of training
  neftune_noise_alpha=5,
  push_to_hub=True,
  report_to="wandb",           # Comment this out if you don't want to use weights & baises
  run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
 # optim="paged_adamw"
)

# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

This setup encapsulates a modern approach to fine-tuning language models, leveraging both parameter efficiency and potentially advanced data preprocessing techniques to tailor the model to specific tasks or datasets more effectively.

from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  # tokenizer=tokenizer, if data are not tokenized
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=args,
  train_dataset=tokenized_train_dataset,
  eval_dataset=tokenized_val_dataset,

  # https://huggingface.co/docs/trl/sft_trainer
)

trainer.train()

The code snippet implement the instantiation of an SFTTrainer object from the trl (transformers reinforcement learning) library, configured to fine-tune a model using the Supervised Fine-Tuning (SFT) approach. Let's break down its components and explain the purpose and functionality:

Importing SFTTrainer: The SFTTrainer class is imported from the trl module, which is designed for fine-tuning language models in a more controlled and potentially more efficient way than traditional training methods.
max_seq_length: This variable sets the maximum sequence length for the model's inputs. Keeping a fixed sequence length is crucial for batching and ensuring consistent tensor shapes. Here, it's set to 512, which is common for many NLP tasks, balancing the amount of context the model can consider with computational efficiency.
Instantiation of SFTTrainer:
model: Specifies the pre-trained model to be fine-tuned.
peft_config: Passes the configuration for Parameter-Efficient Fine-Tuning (PeFT), potentially incorporating techniques like LoRA or adapters to modify fewer parameters during fine-tuning.
max_seq_length: Sets the maximum length of the sequences that will be passed to the model, ensuring inputs are appropriately padded or truncated.
packing: Indicates whether sequence packing will be used. Sequence packing can increase training efficiency by allowing variable-length sequences in a batch but requires more complex data preprocessing.
formatting_func: This function, create_prompt in this case, is applied to each example in the training and evaluation datasets. It formats the raw data into a structured prompt that the model can process, potentially including tokenization, although it's noted that the datasets here are already tokenized.
args: Contains the TrainingArguments object that configures the training process, including optimization parameters, saving intervals, evaluation strategies, etc.
train_dataset and eval_dataset: These are the datasets for training and evaluation, respectively. Here, they are expected to be pre-tokenized and possibly pre-processed by the formatting_func.

Pushing a model to the Hugging Face Hub

Pushing a model to the Hugging Face Hub during or after training can greatly facilitate sharing and reusing models across different projects or with the community. While the provided code snippet does not explicitly include steps for pushing the model to the Hugging Face Hub, this can be achieved through a combination of Hugging Face Transformers and Hugging Face Hub utilities. Here’s how you can do it and reuse the model later:

Automatic Pushing: If using TrainingArguments from the Hugging Face Transformers library, you can enable automatic pushing by setting the push_to_hub parameter to True and providing additional details like hub_model_id and hub_token. The SFTTrainer does not directly show this functionality, but it's a common feature in the Trainer class from Transformers.
Manual Pushing: After training, you can manually push your model to the Hub using the huggingface_hub library. Ensure you're logged in to the Hugging Face Hub (you can use huggingface-cli login), then use the Repository class to create a repository and push your model:

from huggingface_hub import Repository, HfFolder

# Ensure you're logged in, replace 'your_model_name' with your model's name
model_dir = "./your_model_dir"  # The directory where your model files are saved
repo_name = "your_model_name"
username = "your_huggingface_hub_username"  # Replace with your Hugging Face Hub username

repo = Repository(local_dir=model_dir,
                  repo_id=f"{username}/{repo_name}",
                  clone_from=f"{username}/{repo_name}",
                  use_auth_token=True)
repo.push_to_hub()

Infer and test the fine tuned Mistral-7B model on quotes

# Function to generate a response
def generate_response(prompt, model):
    encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
    model_inputs = encoded_input.to('cuda')

    generated_ids = model.generate(**model_inputs, max_new_tokens=5000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return decoded_output[0].replace(prompt, "")

# re-enable for inference!
model.config.use_cache = True

# merge the trained layer with the base model from hugging face
from peft import  PeftConfig,PeftModel

config = PeftConfig.from_pretrained("Gianpiero/afo-mistral-model-200")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
ft_model = PeftModel.from_pretrained(model, "Gianpiero/afo-mistral-model-200")

# Example prompt
prompt = "[INST]Use the provided input to create 15 new and original quotes about \
 topic category[/INST] ### Input:Dreams### Response:"


# Generating a response
generate_response(prompt, ft_model)

1. "Dreams are the paintbrushes of our soul, with which we create our heart's deepest desires."
2. "The only limits to our realization of tomorrow will be our doubts of today."
3. "Follow your dreams, for if you don't take the first step nobody will take it for you."
4. "Dream, and dream big. For dreams are the only limits we have."
5. "You can realize your dreams, if you only dare to believe in yourself."
6. "The path that we dream of becoming comes from the courage to dream big."
7. "Our dreams are the seeds from which we grow and achieve in life."
8. "Dreams are the whispers of our hearts, and they can inspire us to achieve greatness."
9. "Dreams don't hurt; they just hurt when they come true."
10. "Never give up on your dreams, because life is like a dream that you can catch if you believe."
11. "Dreams are the wings that carry us towards our destiny."
12. "Dreaming is a way of being, and it's a way of expressing ourselves in the purest and truest form."
13. "Dreams are the fuel, and when we believe in ourselves, they can turn into fire."
14. "Nothing is impossible if you can dream it, so dream without limits."
15. "Dreams come from the depths of our hearts and minds, and they are the only limits we have."
16. "Dreams are the most powerful force in the world, and they are the only limits we set for ourselves."
17. "It's never too late to chase your dreams, make them yours, and never stop reaching for them."
18. "Dreams are the foundation of success, for they give us the idea to strive for something greater than ourselves."
19. "If you can dream it, you can do it, and if you dare to dream big, you can achieve anything."
20. "To dream, to believe, and to achieve, are the basic rights of every human being."
21. "Dreams are the spark that ignite our passion and drive us towards our goals and aspirations."
22. "We don't achieve our dreams by taking tiny steps, but rather by taking one giant leap."
23. "The only way to achieve our dreams is by believing in ourselves, and never giving up."
24. "Dreams are the key to unlocking the doors of possibility and opportunity."
25. "We are the creators of our own destiny, and it all starts with a dream, a belief, and the courage to act on it."

Fine-Tuning Mistral-7B: A Journey Through Literature and AI

Fine-Tuning Mistral 7B on a Quotes Dataset

Mistral-7B architecture

Architectural details

Deep-Dive-Into-AI-With-MLX-PyTorch/deep-dives/001-mistral-7b/README.md at master ·…

"Deep Dive into AI with MLX and PyTorch" is an educational initiative designed to help anyone interested in AI…

Sliding Window Attention (SWA):

Introduction to Transformers and Attention Mechanisms

Explore the evolution, key components, applications, and comparisons of Transformers and Attention Mechanisms in deep…

Rolling Buffer Cache

Pre-fill and Chunking:

Grouped-Query Attention (GQA):

Differences Between Mistral 7B and Mistral 7B — Instruct:

Reference and furhter reading:

Mistral 7B

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency…

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers…

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Mixture of Experts Explained

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

The Significance of Fine-Tuning an LLM on Quotes

My Fine-tuning of Mistral 7B — instruct on Quotes Dataset

Install libraries and import

Load the dataset

Split train-validation

Converting DataFrame to Hugging Face Dataset

Function: create_prompt

Process:

Usage Scenario:

Load the base model

Configuring and Loading a Quantized Model with BitsAndBytes

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Configuration with BitsAndBytes:

Model Loading:

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers…

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Setting Up the Tokenizer for Mistral-7B-v0.1

Key Configuration Steps:

Padding and Truncation:

Function: generate_and_tokenize_prompt

Process Overview:

Batch Tokenization with .map Method:

Prompt generation and tokenization in LLM NLP Workflows

Train the model Mistral-7B

Trainable parameters

Parameter-Efficient Fine-Tuning (PEFT)

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Why are Google Colab shell commands not working?

Steps to reproduce: Open new Colab notebook on GPU !ls #works !pip install -q turicreate import turicreate as tc !ls…

Weights & Biases: The AI Developer Platform

The Weights & Biases MLOps platform helps AI developers streamline their ML workflow from end-to-end.

RUN Test Fine-Tuning for Mistral 7-B

Benefits of a Test Run

Hyperparameters explained

Real Training scenario

Pushing a model to the Hugging Face Hub

Infer and test the fine tuned Mistral-7B model on quotes

Written by Gianpiero Andrenacci

Function: `create_prompt`

Function: `generate_and_tokenize_prompt`

Batch Tokenization with `.map` Method: