Adapting Mixtral 8x7B for Low-Resource Language Understanding

Introduction

Christopher Ibe
11 min readMar 23, 2024

Building on the success of our previous work with the Mistral 7B model, our latest research at Hypa AI takes a leap forward with the Mixtral 8x7B parameter model, a direct successor designed to push the boundaries of AI language understanding. This venture aims to address the critical challenge of linguistic and cultural preservation in the digital era. By focusing on English-Igbo translations, we endeavor to enhance AI’s proficiency in handling Igbo — a language rich in nuances yet significantly underrepresented on the broader internet. In an age where AI’s influence continues to expand, the risk of cultural and linguistic extinction looms large for low-resource languages like Igbo. Our project, therefore, is not just about bridging language divides; it’s about safeguarding the very essence of diverse linguistic identities against the backdrop of a rapidly evolving digital landscape. Through this work, we underscore the paramount importance of inclusivity and representation in AI, ensuring that every language and culture secures its rightful place in our shared digital future.

The Mixtral 8x7B Model: An Overview

As a quick refresher, here are some of the key features included in the previous Mistral 7B model. Please revisit our previous work for more info:

  • Sliding Window Attention (SWA): With an 8k context length and a fixed cache size, Mistral 7B achieves a theoretical attention span of 128K tokens, attributed to a 4k sliding window size across its 32-layered architecture. This expansive attention mechanism allows for nuanced understanding and generation of lengthy text sequences by effectively managing computational resources.
  • KV-Cache (KVC): Enhances memory efficiency by storing key-value pairs from previous computations, which can be quickly retrieved and reused in subsequent processing steps. This mechanism reduces redundant computations and speeds up the inference process, especially in tasks involving long sequences of text.
  • Rolling Buffer Cache (RBC): This feature, an artifact of SWA and KV-Cache, employs a fixed attention span (equivalent to the SWA window size) to limit cache size. It stores the keys & Values for each time step in the cache and when the cache reaches its limit, earlier values are overwritten in a rotating manner thus halting cache size growth.
  • Pre-fill & Chunking: Another artifact of the above three (SWA, KVC, and RBC) involves pre-filling the cache with relevant context (i.e the LLM prompt) before actual computations begin, combined with chunking, or breaking down input sequences into manageable pieces (usually also SWA window size), significantly enhances processing speed and model responsiveness. This approach ensures that the model is primed and ready to tackle extensive data inputs from the get-go.
  • Rotary Positional Encodings: Incorporates a novel method of encoding the position of tokens within the sequence, providing a more dynamic understanding of the sequence’s structure. This encoding method helps preserve the relational positions of tokens, improving the model’s ability to interpret and generate contextually coherent text.
  • RMS Normalization (Pre-Norm): Utilizes Root Mean Square normalization before each layer’s computations to stabilize the learning process. This form of pre-layer normalization helps in mitigating the vanishing or exploding gradient problem, making the model training more stable and efficient.
  • Grouped Query Attention (GQA): This innovation enables faster inference times and reduced cache sizes by grouping queries before computing attention, streamlining the process without compromising the model’s depth of understanding.
  • Byte-fallback BPE tokenizer: By ensuring that characters are never mapped to out-of-vocabulary (OOV) tokens, this tokenizer eliminates the occurrence of unknown tokens (“<unk>”), thereby preserving the integrity of the input data.

Building on the success of its predecessor, the Mixtral 8x7B model introduces several innovative architectural advancements designed to set new benchmarks in language processing:

  • Sparse Mixture of Experts (SMoE): Similar to switch transformers, this ensemble architecture employs a dynamic routing mechanism to direct input data to the two most relevant of ‘expert’ networks within the model. This specialization allows the model to handle a wide range of linguistic nuances and tasks more efficiently, by leveraging expertise concentrated on specific linguistic features. This also reduces the effective active parameter count of the model during inference and thus provides significant speed improvements.
  • Model Sharding (Pipeline Parallelism): This technique involves distributing different components of the model across multiple processing units, allowing parallel computation of the model’s layers. It enhances the model’s scalability and speed by dividing the workload, making it particularly effective for training and deploying large-scale language models like Mixtral 8x7B.
  • Block Attention (xFormers library): this is a technique implemented to fully exploit GPUs at inference time by concatenating multiple users’ prompt into a single mega sequence and producing outputs for the corresponding prompt. This involves using a BlockDiagonalCausalMask (present in the former’s library) to prevent one prompt from paying attention to another prompt in the concatenated mega-sequence prompt.
Figure 1: Mixture of Experts (MoE)

Technical Workflow: A Deep Dive

The technical workflow for adapting the Mixtral 8x7B model for English-Igbo translations closely mirrors the meticulous and innovative approach we applied in our previous Mistral 7B project. However, with Mixtral 8x7B, we’ve ventured into new territories of model architecture and efficiency, aiming for both linguistic accuracy and operational excellence in processing low-resource languages.

  • Dataset Acquisition and Preparation: In fine-tuning this Mixtral 8x7B model, we utilized two distinct sets of English-Igbo pairs, both compiled by the AfroVoices team. The first 2,000 English-Igbo pairs dataset, previously employed in our Mistral 7B project, was complemented by a new, internally translated set derived from the TinyStories dataset by Microsoft Research. This addition, consisting of 1,750 English-Igbo pairs, was chosen for its linguistic simplicity and the presence of long-range dependencies typical of children’s stories. The TinyStories dataset’s unique attributes make it particularly suitable for training models to understand and replicate nuanced language patterns, adding depth to our model’s understanding of Igbo.
  • Prompt Formatting: Key to our workflow was the strategic formatting of training prompts, designed to encapsulate the essence of the translation task. Each prompt comprised a system message to guide the model, followed by an English sentence and its Igbo counterpart. This structured approach was vital in maintaining consistency and effectiveness across the training phase, ensuring each model instance was primed for English-Igbo translations. See example below:
<s>[INST]YOU ARE AN EXPERT IGBO TRANSLATOR. USE THE PROVIDED INPUT TO TRANSLATE THE ENGLISH INPUT TEXT INTO IGBO.

### English:
Once upon a time, there was a gifted little girl named Lucy. Every day, she would march around her house with a big smile on her face. She loved to explore and learn new things.

One day when Lucy was marching around the house, she found an ashtray under the kitchen table. She was curious about it and stopped to take a closer look. She picked it up and felt how heavy it was. She wondered what the ashtray was made for.

Lucy carried the ashtray to her mom, who explained that it was a special container used for holding cigarettes. Lucy listened carefully and nodded her head. She thanked her mom, then marched around the house with her new knowledge.

From that day on, Lucy became even more gifted. She was so curious and eager to explore her world. Every day she marched around her house, learning and discovering new things in her own special way.[/INST]

### Igbo:
otu oge, e nwere otu nwa agbọghọ nwere onyinye aha ya bụ Lucy. Kwa ụbọchị, ọ na-eji ihu ọchị na-aga gburugburu ụlọ ya. Ọ na-amasị ya inyocha na ịmụta ihe ọhụrụ.

Otu ụbọchị mgbe Lucy nọ na-agagharị n'ụlọ ahụ, ọ hụrụ otu ntụ ntụ n'okpuru tebụl kichin. ọ chọsiri ike ịmata ya wee kwụsị ilerukwu anya. O welitere ya ma nwee mmetụta ka ọ dị arọ. Ọ nọ na-eche ihe e mere ntụ ntụ maka ya.

Lucy bugara mama ya ihe ntụ ntụ ahụ, bụ́ onye kọwara na ọ bụ akpa pụrụ iche e ji ejide sịga. lucy gere nti nke oma wee nye isi ya. O kelere nne ya, wee jiri ihe ọmụma ọhụrụ ya gagharịa n'ụlọ ahụ.

Site n'ụbọchị ahụ gaa n'ihu, Lucy ghọrọ onye nwere nkà karịa. Ọ chọsiri ike ịmata ụwa ya. kwa ụbọchị, ọ na-ejegharị gburugburu ụlọ ya, na-amụta na-achọpụta ihe ọhụrụ n'ụzọ nke ya pụrụ iche.</s>
  • Model Loading with Quantization: We loaded the SFT fine-tuned and DPO optimized “Mixtral-8x7B-Instruct-v0.1” model in a 4-bit configuration, utilizing double quantization with bfloat16 as the compute dtype. Additionally, we used the flash attention implementation. These steps were crucial for optimizing the model’s performance, particularly in terms of memory usage and computational efficiency, without compromising its translation accuracy.
  • Initial Model Testing: Before embarking on the fine-tuning process, we evaluated the model’s baseline performance using the evaluation dataset. This preliminary phase was pivotal in establishing a performance benchmark, ensuring we had a clear understanding of the model’s capabilities in generating Igbo translations from English inputs.
  • Custom Training Loop with PEFT, LoRA, and Sharding: Our custom training loop incorporated the Parameter Efficient Fine-Tuning (PEFT) library in conjunction with a Low-Rank Adaptation (LoRA) configuration. This innovative approach allowed us to update only a minimal, low-rank projection of the model’s weights efficiently; specifically on only the attention, gates, and the feedforward layer (see code chunk below). By setting a lora_alpha of 128 and a rank of 64, we effectively trained only 18% of the models total trainable parameters. We used a constant learning rate across three epochs parallelized across two A100 80GB GPUs hosted on runpod.io, we fine-tuned the model for optimal English-Igbo translation performance.
# Lora config
peft_config = LoraConfig(
lora_alpha=128,
lora_dropout=0.1,
r=64,
bias="none",
target_modules=[
"q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj",
"up_proj", "down_proj", "lm_head",
],
task_type="CAUSAL_LM"
)
# Model Parallelization
if torch.cuda.device_count() > 1:
print(torch.cuda.device_count())
model.is_parallelizable = True
model.model_parallel = True


# Hyper Parameters
args = TrainingArguments(
output_dir = "Mixtral7B_Igbo_translation_v1",
num_train_epochs=3,
per_device_train_batch_size = 2,
warmup_steps = 0.03,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
learning_rate=2e-4,
bf16=True,
)

Results and Evaluation: A Mixture of Experts (MoE) Approach Toward Multilingual Understanding

Our comprehensive evaluation of the Mixtral 8x7B model’s performance in translating English to Igbo showcases the significant strides made towards improving AI’s understanding of low-resource languages. Utilizing a carefully curated evaluation set of 10 data points from our previously stowed eval dataset, we embarked on a detailed analysis to measure the effectiveness of both the default Mixtral model and its finetuned counterpart.

BLEU Score Analysis: The original Mixtral model, prior to any fine-tuning, achieved an average BLEU score of 2.5. This baseline performance, while modest, provided a crucial starting point for our fine-tuning endeavors. Following a rigorous supervised fine-tuning process, we observed a remarkable fivefold improvement in the model’s BLEU score, soaring to an average of 10.25. BLEU scores, or Bilingual Evaluation Understudy scores, serve as a critical metric in our analysis, offering quantitative insights into the translation quality by comparing the model-generated outputs against reference human translations. This metric emphasizes the precision of word sequences (n-grams) and incorporates a brevity penalty to discourage overly short translations, ensuring that the evaluations reflect both accuracy and coherence.

Table 1: English-Igbo Translation Using the Default Mixtral Model
Table 2: English-Igbo Translation Using the SFT Finetuned Mixtral Model

Human Expert Review

To complement our quantitative analysis with qualitative insights, we engaged three different human translation experts to evaluate the model’s outputs. Our AfroVoices team, comprising specialists in Igbo language and culture, conducted a thorough review of the translations generated by both the default and finetuned Mixtral 8x7B models. Their observations echoed the improvements highlighted by the BLEU scores; the finetuned model exhibited enhanced coherence in its Igbo translations, markedly surpassing the default model’s outputs, which lacked syntactic and semantic coherence, often resulting in incoherent Igbo text.

While the default model’s translations frequently missed the mark, displaying a lack of coherent linguistic structure, the finetuned model demonstrated clear signs of significant learning, especially for translations derived from the model trained with the extended supervised fine-tuning process. This validation by human experts not only reaffirms the quantitative improvements but also highlights the model’s augmented capability to convey the nuances of Igbo language, making strides towards bridging the gap in AI’s understanding of such underrepresented languages.

Tokenization Analysis:

We further explored on the possible effects of tokenization on AI’s language understanding by using the Mixtral 8x7B tokenizer. A crucial step in our analysis was plotting a histogram to inspect the token length distribution across our dataset. Remarkably, we observed that the majority of tokenized sequences fell below the 1,500 count, aligning well with Mixtral’s training on sequences up to 32,000 tokens in length. This distribution underscores the tokenizer’s efficiency in handling our dataset, ensuring most inputs remain within the model’s optimal operational range for context length.

Figure 2: Length of Input Tokens for Each Training Data

However, a deeper dive into the tokenization of English and Igbo translations separately unveiled a significant discrepancy. For identical English-Igbo translation pairs, the Igbo token count was approximately 2.5 times greater than its English counterpart. This disparity likely stems from the Byte Pair Encoding (BPE) tokenizer’s training, which, despite its advanced capabilities, may not have been sufficiently exposed to Igbo and other low-resource languages. Consequently, this tokenization imbalance introduces a computational bottleneck, as the model is compelled to process an inflated number of tokens for Igbo translations, compared to English, for the same conceptual content. This not only stretches the model’s attention mechanisms but also could potentially dilute its understanding by spreading its focus too thinly across a broader token array.

Figure 3: Length of Input Tokens for Each only English Training Data
Figure 4: Length of Input Tokens for Each corresponding Igbo Translation Training Data

For languages like Igbo, where digital representation is sparse, such a tokenizer bias can significantly impede the model’s ability to grasp and reproduce the language’s nuances accurately, posing a challenge to achieving equitable AI understanding across languages. Addressing this tokenizer bias is crucial for enhancing low-resource language comprehension, underscoring the need for more inclusive training datasets that better represent the linguistic diversity of our world.

Conclusion

The results from our exploration into fine-tuning the Mixtral 8x7B model for English-Igbo translations underscore the potential for AI to make meaningful advancements in processing and understanding low-resource languages. The fivefold improvement in BLEU scores post-finetuning, coupled with the affirmative evaluations by language experts, illustrates the model’s enhanced proficiency in delivering coherent and culturally resonant Igbo translations. As we continue to refine our methods and expand the scope of our work, these findings not only validate the efficacy of our approach but also pave the way for further research aimed at democratizing language AI, ensuring every language, no matter how underrepresented, finds its voice in the digital age.

About the Authors

Christopher Ibe and Okezie Okoye continue to lead Hypa AI towards new frontiers in AI translation. Their dedication to leveraging advanced AI for genuine understanding and connection across language barriers is what sets Hypa AI apart in the field of artificial intelligence.

Hypa AI remains steadfast in its mission to pioneer intelligent solutions that are not just technologically advanced but are also culturally aware, ensuring that the future of AI is as diverse and inclusive as the world it serves.

AfroVoices, a subsidiary of Hypa AI, is dedicated to amplifying African voices, languages, and cultures in the intelligence age. Focused on bridging the digital representation gap, AfroVoices curates datasets and resources for African languages, promoting inclusivity and cultural appreciation in AI technologies. Their mission goes beyond technological innovation, aiming to celebrate the richness of African linguistic diversity on a global stage.

--

--