Mistral 7b Meets Darija 🇩🇿: A Continual Pre-training Journey

15 min read6 days ago

Asalam Alaikom Brothers and Sisters👋

Our beautiful Algerian Darija is a unique blend of languages a rich tapestry woven with Arabic, French, and even touches of English and Spanish. it’s the language we use every day to express our thoughts, share our stories, and connect with each other.

But here’s the thing: as much as we love our Darija, it’s often overlooked in the world of technology. that’s where our mission comes in we want to make machines understand our language!

This journey started with our first project, 🇩🇿 Darija-GPT: Let's make machines understand our language! in this project, we trained GPT-2 from scratch on Algerian Darija, laying the foundation for what was to come. Then we took things to the next level with Llama 2🦙 Speaks Darija 🇩🇿: From Scratch to Darija Mastery where we focused on training Llama 2 on our unique language blend.

Now , im very excited to introduce the third part of this series: a deep dive into continuous pre-training to further improve Darija models . this step is essential to ensure that our AI not only understands but thrives in our linguistic landscape.

Why continuous pre-training?

You might wonder, why can’t we just train a model on Darija directly?

the answer lies in the complexity and richness of our language. since Darija is a mix of several languages, a model needs a solid foundation in Arabic and French to truly grasp the nuances of our daily conversations. continual pre-training helps us build this foundation step by step, starting with a model that already understands Arabic and French, and then gradually teaching it the intricacies of Darija.

The nature of the Algerian Darija

Darija isn’t just a language or dialect; it’s a reflection of our culture, history, and daily lives. It’s the way we joke, argue, and express our deepest feelings.

However, finding a comprehensive dataset for Darija is a challenge. thankfully, several startups are working hard to collect Darija data , it would be incredibly beneficial if parts of this data, not necessarily all, were made available for everyone and not just for commercial purposes, this collaborative effort would significantly aid in fine-tuning our models and preserving our linguistic heritage.

this journey is going to be long and detailed, so get ready ,i’ve tried to make this content accessible for all levels of AI learners, adding explanations where necessary to ensure we’re all on the same page. so grab your cup of “na3na3” and join me on this exciting journey to make machines understand Algerian Darija.

Before diving into the specifics, it’s essential to understand the broader context of our mission. We’re on a quest to make machines understand Algerian Darija, a vibrant mix of Arabic, French, and other influences. In this journey, we’ll explore the Mistral-7B model’s architecture, the process of Continued Pre-Training, and how we’re leveraging various datasets. Additionally, we’ll delve into advanced techniques like Parameter-Efficient Fine-Tuning (PEFT), the Unsloth Framework, Rotary Position Embedding (RoPE), and Flash Attention-2. Each of these components plays a crucial role in our overarching goal.

Mistral-7B :

The Mistral-7B model is a state-of-the-art large language model developed by Mistral AI.

Architecture :

The Mistral-7B model is a robust large language model built on a decoder-only transformer architecture. it consists of 7 billion parameters, finely tuned to offer exceptional performance. the model employs advanced attention mechanisms, including Sliding Window Attention and Grouped-Query Attention (GQA), enhancing both inference speed and memory efficiency.

Key parameters defining the Mistral-7B architecture include:

dim: 4096
n_layers: 32
head_dim: 128
hidden_dim: 14336
n_heads: 32
n_kv_heads: 8
window_size: 4096
context_len: 8192
vocab_size: 32000

Attention Mechanisms:

It employs advanced attention mechanisms such as Sliding Window Attention and Grouped-Query Attention (GQA) which help in achieving faster inference and efficient memory usage.

Tokenizer:

Uses a Byte-Pair Encoding (BPE) tokenizer with byte fallback, ensuring that no characters are out-of-vocabulary.

Performance

Benchmark Performance: Mistral-7B significantly outperforms the Llama 2 models, including Llama 2 13B, across a variety of benchmarks. It excels in areas like mathematics, code generation, and reasoning, approaching the performance of even larger models without sacrificing efficiency.
Instruction Tuning: Mistral-7B-Instruct versions, fine-tuned on instruction datasets, show superior performance on benchmarks like MT-Bench compared to other 7B models and are competitive with some 13B models.

Usage

Pretrained and Instruction-Tuned Models: Mistral AI has released different versions including a base pretrained model (Mistral-7B-v0.1) and instruction-tuned versions (Mistral-7B-Instruct-v0.1,v0.2 and v0.3 )
Flash Attention: The model’s performance can be enhanced further using Flash Attention, an optimized implementation of the attention mechanism, which significantly speeds up the model’s inference times.

Accessibility

Open Source: Mistral-7B is available under the Apache 2.0 license, and can be accessed and used through platforms like Hugging Face.

Continued Pre-Training :

Continued pre-training of large language models (LLMs) involves further training a pre-existing model on new data, often to adapt it to specific domains or improve its performance over time. This process can help the model retain relevance and incorporate the latest information without the need to train a new model from scratch.

Key Concepts in Continued Pre-Training

Stability Gap: the performance of LLMs can initially drop when starting the continual pre-training phase, a phenomenon referred to as the stability gap. this is due to the model needing to adjust to new data, which can cause a temporary increase in loss before it starts improving again .
Warm-up Phase: The length and intensity of the warm-up phase (initial phase where learning rate is gradually increased) can impact the efficiency of continual pre-training. however, research suggests that the length of the warm-up phase might not significantly affect the model’s performance on validation tasks, although proper warm-up can prevent initial instability.
Learning Rate Adjustment: adjusting the learning rate during continual pre-training is crucial. a too-high learning rate can lead to catastrophic forgetting of previously learned information, while a too-low rate can slow down learning on new data. optimal learning rate schedules balance these aspects to ensure smooth transitions and effective learning .

Applications and Benefits

Domain Adaptation: continual pre-training is especially useful for adapting general LLMs to specific domains like medical, financial, or legal texts or languages like Arabic or French. this method is more cost-effective compared to training new models from scratch and allows models to acquire domain-specific knowledge effectively .
Enhanced Performance: continued pre-training helps in improving the performance of LLMs on specific tasks and datasets. for example, models continually pre-trained on medical data show improved performance on medical-related tasks while maintaining general language capabilities.
Efficiency: this approach offers a middle ground between the expensive process of training from scratch and the limited scope of fine-tuning. it allows for comprehensive knowledge acquisition and style adaptation with moderate computational resources .

Challenges and Considerations

Catastrophic Forgetting: one of the main challenges is avoiding the forgetting of previously learned information when the model is updated with new data. careful management of learning rates and training schedules and Data mixing can mitigate this issue .
Data Quality and Diversity: the quality and diversity of the continual pre-training data are critical. poor quality or overly narrow data can lead to degraded model performance or overfitting to specific data characteristics .

Continued pre-training thus serves as a vital strategy for maintaining and enhancing the relevance and performance of large language models over time, adapting them to new information and specific domains effectively.

Datasets

1. Arabic_Mix Dataset:

Composition: 70% Arabic, 30% French.
Source: Wikipedia dataset.
Purpose: To ensure the model develops a balanced understanding of Arabic and a foundational knowledge of French.

2. French_Mix Dataset:

Composition: 70% French, 30% Arabic.
Source: Wikipedia dataset.
Purpose: To reinforce French language capabilities while maintaining Arabic knowledge.

3. Algerian Darija Dataset:

Size: Over 170,000 rows.
Sources: Existing Darija datasets, web scraping, YouTube comments, and transcripts.
Challenges: Contains imperfections due to speech-to-text limitations, requires further cleaning for quality enhancement.
Purpose: To fine-tune the model to understand and generate text in Algerian Darija, reflecting its mixed-language structure.

Parameter-Efficient Fine-Tuning :

Parameter-Efficient Fine-Tuning (PEFT) is a technique designed to adapt large pre-trained models to specific tasks with minimal computational and storage costs. this approach involves fine-tuning only a small subset of the model’s parameters rather than the entire model, making it significantly more efficient. the key benefit is achieving performance comparable to full fine-tuning while requiring fewer resources, which is crucial as model sizes continue to grow.

PEFT Methods :

1 . Low-Rank Adaptation (LoRA): this method introduces low-rank matrices to the model parameters, allowing the model to learn task-specific adaptations without modifying the entire parameter set. this is particularly effective for large models and can be implemented with minimal additional computational overhead .

2 . Prefix Tuning and Prompt Tuning: these methods involve adding tunable prompts or prefix tokens to the input sequences, effectively guiding the model to adapt to new tasks. they are particularly useful for text generation and language understanding tasks.

3 . Adapters: these are small neural network modules inserted within the layers of the pre-trained model. adapters enable the model to learn task-specific features without changing the original model weights, thus maintaining the pre-trained knowledge while adapting to new tasks.

4 . Sparse Fine-Tuning: this method selects a small subset of parameters to tune, based on their relevance to the task. this reduces the computational load and storage requirements further by updating only the most important parameters .

PEFT methods are integrated with libraries like Hugging Face’s Transformers, Diffusers, and Accelerate, which facilitate the loading, training, and deployment of large models for various tasks efficiently. this makes it possible to train and use large language models on consumer hardware with limited resources.

In summary, PEFT offers a practical solution for fine-tuning large models by reducing the number of trainable parameters, thereby saving computational resources while maintaining high performance. This makes it an attractive approach for deploying sophisticated models in resource-constrained environments.

Unsloth Framework :

The Unsloth framework is designed to significantly enhance the efficiency and effectiveness of fine-tuning large language models like LLaMA, Mistral, Phi, and Gemma.

Why Unsloth ?

Performance Improvement: Unsloth enables fine-tuning 2–5 times faster while using up to 80% less memory compared to traditional methods. this is particularly beneficial for users with limited computational resources.
Memory Efficiency: by optimizing memory usage, Unsloth allows for efficient handling of large models on less powerful hardware, making it accessible for a broader range of users.
Supported Models: The framework supports a wide array of models including LLaMA, Mistral, Phi, and Gemma across different sizes (e.g., 7B, 13B). this versatility ensures that users can work with various model architectures and configurations.
Saving and Loading Models: after fine-tuning, models can be saved locally or uploaded to platforms like Hugging Face for easy sharing and deployment. this flexibility in saving and loading models helps streamline the workflow for further use or inference.
Advanced Features: The framework supports several advanced techniques such as RoPE scaling for extending context length, FlashAttention-2 for efficient attention computation, and dataset streaming for handling large datasets more effectively.

Rotary Position Embedding (RoPE) :

Rotary Position Embedding (RoPE) is an innovative approach designed to incorporate positional information into transformer models, enhancing their capability to handle sequences of various lengths and improving their generalization to new sequence lengths. RoPE was introduced to address limitations found in both fixed and learned positional embeddings, particularly their underperformance with sequences significantly different from training data.

RoPE operates by rotating the query and key vectors in the self-attention mechanism, where each position in the sequence receives a unique rotation. This rotation ensures that the dot product between queries and keys diminishes for tokens that are distant from one another, effectively encoding relative positional information.

Flash Attention-2

FlashAttention-2 is an optimized version of the original FlashAttention algorithm designed to accelerate attention mechanisms in neural networks, particularly for transformers. the key improvements in FlashAttention-2 revolve around reducing the number of non-matrix multiplication (non-matmul) floating-point operations (FLOPs), enhancing parallelism, and optimizing work partitioning.

Connecting the Dots :

Having explored the foundational elements of the Mistral-7B architecture, Continued Pre-Training, datasets, Parameter-Efficient Fine-Tuning (PEFT), the Unsloth Framework, Rotary Position Embedding (RoPE), and Flash Attention-2, we are ready to delve into the practical application of these concepts.

By leveraging the Unsloth framework, which supports RoPE and Flash Attention-2, we can optimize the continual pre-training process for the Mistral-7B model. using PEFT, we will efficiently adapt the model to our datasets: Arabic_Mix, French_Mix, and Algerian Darija. this strategic approach ensures the model retains previous knowledge while seamlessly incorporating new linguistic data.

Code Time:

Load the LoRA Model and Tokenizer :

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

The initial code block focuses on setting up the environment for our project by installing essential packages.

from unsloth import FastLanguageModel
import re

from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments
import torch

In this code, we import the necessary components and configure parameters for working with language models.

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None          # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mistralai/Mistral-7B-v0.3",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

We then load a pre-trained Mistral-7B model and tokenizer with specific settings for sequence length, data type, and 4 bit quantization to optimize performance and memory usage.

In this step, we configure the LoRA (Low-Rank Adaptation) settings for the pre-trained model.

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

We specify parameters for the adaptation process, including the rank, target modules, and dropout settings, to optimize the model for continued training. We also enable advanced features like gradient checkpointing and rank-stabilized LoRA to improve efficiency and manage larger batch sizes.

Prepare Arabic mix Dataset :

In this code, we define a function to format Arabic Wikipedia text by cleaning newlines and adding a special end-of-sequence token.

import re

def clean_newlines(text):
    return re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

arabic_wikipedia_prompt = """
العنوان: {}
النص :
{}
"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def arabic_formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = arabic_wikipedia_prompt.format(title, clean_newlines(text)) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

arabic = load_dataset("ayoubkirouane/Arabic_mix", split = "train",)
arabic = arabic.map(arabic_formatting_prompts_func, batched = True,)

We then apply this formatting to a dataset of Arabic text by mapping the formatting function over the dataset to prepare it for further use.

Start the first round :

In this step, we start the first round of continued pre-training for the model using the Arabic text dataset.

# start The first round of Continuel Pre-Training
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = arabic,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        warmup_ratio = 0.1,
        num_train_epochs = 50,

        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
    ),
)
trainer_stats = trainer.train()

We set up the UnslothTrainer with parameters for batch size, learning rates, and optimization techniques, and then begin the training process for 50 epochs to refine the model.

Prepare French mix Dataset :

In this code, we define a function to format French Wikipedia text by cleaning newlines and adding a special end-of-sequence token.

french_wikipedia_prompt = """
titre : {}
text :
{}
"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def french_formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = french_wikipedia_prompt.format(title, clean_newlines(text)) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass
french = load_dataset("ayoubkirouane/French_mix", split = "train",)
french = french.map(french_formatting_prompts_func, batched = True,)

We then apply this formatting to a dataset of French text to prepare it for further processing.

Start the second round :

In this step, we initiate the second round of continued pre-training for the model.

# start The second round of  Continued Pre-Training
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = french,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 50,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
trainer_stats = trainer.train()

We configure the UnslothTrainer with settings for batch size, learning rates, and optimization strategies. we then run the training process on a French text dataset and capture the training statistics for evaluation.

Start the Last round :

In this step, we load the Algerian Darija dataset and initiate the final round of continued pre-training for the model.

# Load Darija Dataset 
darija = load_dataset("ayoubkirouane/Algerian-Darija" , split="v1")

# start The Last round of  Continued Pre-Training
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = darija,
    dataset_text_field = "Text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 5,

        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
trainer_stats = trainer.train()

We configure the UnslothTrainer with specific settings for batch size, learning rates, and optimization methods, and then train the model for 5 epochs to further refine its performance.

model.save_pretrained("mistral_darija") # Local saving
tokenizer.save_pretrained("mistral_darija")

In this final step, we save the Trained Lora Adapters and Tokenizer to a local directory named “lora_model” for future use or deployment.

Inference Time :

We load the previously saved LoRA model and tokenizer for inference.we configure the model for faster inference and prepare input text for generation.

from unsloth import FastLanguageModel 
from transformers import TextStreamer 

dtype = None 
load_in_4bit = True 
max_seq_length=2048 

model, tokenizer = FastLanguageModel.from_pretrained( 
        model_name = "mistral_darija" , 
        max_seq_length = max_seq_length, 
        dtype = dtype, 
        load_in_4bit = load_in_4bit, 
    ) 
FastLanguageModel.for_inference(model) # Enable twice as fast inference
 inputs = tokenizer([ "وحد نهار" ], return_tensors = "pt" ).to( "cuda" ) 


text_streamer = TextStreamer(tokenizer) 
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Finally, we generate a script using the model and stream the output to display the results.

And there you have it !! we’ve journeyed through the fascinating world of AI and Darija, exploring the intricate details of continual pre-training, datasets, and some advanced techniques. this is just my personal diligence, a humble attempt to contribute to the exciting field of AI language models for Darija. we’re still in the early stages, and there might be bumps along the way, but I’m incredibly excited to see where this journey takes us.

Linkedin : https://www.linkedin.com/in/ayoub-kirouane3

Huggingface : https://huggingface.co/ayoubkirouane

Complete Code : https://github.com/Kirouane-Ayoub/Mistral-7b-Meets-Darija

Similar projects :