Enabling Quality Embedding and Text Generation for Amharic Language

15 min readFeb 3, 2024

In the fast-paced world of African startups, AIQEM is making waves with its cutting-edge AI and Blockchain solutions. Their latest game-changer, Adbar, is an AI-based Telegram Ad solution tailored for Ethiopian businesses.

To ensure Adbar’s success in the Telegram ecosystem, AIQEM is embarking on a mission to enhance its advertising strategy. The focus? Developing an Amharic RAG pipeline that leverages top-tier open-source language models like Mistral 8 7B or fine-tuned versions of LLama 2.

This blog series unveils the journey of fine-tuning these models to desired purpose.

Project approach

Vision & Scope: As a team, we collectively defined a clear vision for our project, outlining specific objectives that reflected the diverse expertise within our group. We carefully considered whether our LLM would serve as a versatile tool or focus on our specific task.

Model Selection: After extensive discussions, We considered the intricacies of our project, taking into account the varied perspectives and expertise within the team.

Model’s Performance and Adjustment: Upon preparing our model, we collectively assessed its performance. Faced with challenges, we engaged in prompt engineering using rag. It was crucial to ensure that the model’s outputs aligned with human preferences, and we worked together to refine it until it met the desired standards.

Evaluation & Iteration: Regular team evaluations became a cornerstone of our project. Metrics and benchmarks were essential tools in gauging the effectiveness of our model. The iterative process of prompt engineering, fine-tuning, and evaluation was a collaborative effort, allowing us to achieve the desired outcomes over time.

Deployment: With a well-performing model, we transitioned to deployment as a team. Optimizing for computational efficiency and user experience became our collective focus.

Understanding the concepts

In the ever-evolving landscape of Large Language Models (LLMs), understanding the intricacies of their development and adaptation is crucial for harnessing their full potential. Below we delve into key components, methodologies, and techniques shaping the LLM landscape, focusing on pre-training, Supervised Fine-Tuning (SFT), Parameter-efficient Tuning (PEFT), and Low-Rank Adaptation (LoRA).

What is LLM fine-tuning?

Large language model (LLM) fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain. Fine-tuning is about turning general-purpose models and turning them into specialized models. It bridges the gap between generic pre-trained models and the unique requirements of specific applications, ensuring that the language model aligns closely with human expectations. Think of OpenAI’s GPT-3, a state-of-the-art large language model designed for a broad range of natural language processing (NLP) tasks. Suppose a healthcare organization wants to use GPT-3 to assist doctors in generating patient reports from textual notes. While GPT-3 can understand and create general text, it might not be optimized for intricate medical terms and specific healthcare jargon.

Pre-training

Pre-training forms the bedrock of LLM development, exposing the model to vast corpora of text data to grasp linguistic patterns and construct nuanced language representations. Techniques like Masked Language Modeling (MLM), Contrastive Learning, and Causal Language Modeling (CLM) contribute to the versatility of LLMs, enabling them to excel in downstream Natural Language Processing (NLP) tasks such as text classification, question answering, and summarization.

Supervised Fine-Tuning (SFT)

SFT emerges as a crucial technique to enhance LLMs for specific tasks or domains. By training the model on labeled data, SFT refines its understanding of task-specific patterns, resulting in improved accuracy and efficiency. SFT finds applications in diverse tasks, from instruction following to text generation and summarization, making LLMs adaptable to a spectrum of real-world scenarios.

Parameter-efficient Tuning (PEFT)

PEFT, a paradigm shift from traditional fine-tuning, minimizes computational and storage costs while optimizing LLMs for specific tasks. Strategies like low-rank matrices, soft prompts, and adapters allow for efficient adaptation across various modalities, making PEFT a valuable approach for multimodal applications. This method strikes a balance between computational efficiency, storage optimization, and task-specific adaptation.

Low-Rank Adaptation (LoRA)

LoRA takes model adaptation to a new level by selectively adjusting a subset of parameters, primarily using low-rank matrices. This technique significantly reduces computational and storage costs, making it feasible to adapt LLMs to specific tasks without compromising efficiency. LoRA’s versatility across modalities, including text, image, and audio, positions it as an efficient and effective solution for achieving streamlined adaptation.

Data preparation

The initial phase of our data processing adventure revolves around parsing information from the provided zip file. Here’s a breakdown of the steps involved:

File I/O Operations: Fundamental yet crucial, the read_file and write_file methods facilitate seamless loading and saving of data in the widely used JSON format. This ensures structured storage and easy retrieval.
Text Parsing: The parse_text method emerges as the hero, tackling the diversity of data structures by consolidating text content. The parse_messages method then extracts vital information — including message ID, text content, and date — from a list of messages, ensuring we focus on relevant content while excluding non-text elements.
JSON Data Processing: The extract_fields method plays a pivotal role in isolating crucial information from each message. This step is fundamental for zeroing in on specific details like text content, date, and labels, streamlining the extraction process.
Zip File Processing: In scenarios where data is distributed across multiple files, the ability to process zip files containing multiple JSON files is invaluable. This approach not only aids in handling large datasets efficiently but also enhances overall organization.
Additional Functionality: The file_reader method offers a convenient way to extract text content from a file, especially when dealing with non-JSON files or when a simpler text extraction is needed.
Pandas Integration: Leveraging the power of Pandas for creating and manipulating DataFrames brings structure to the data. DataFrames prove to be an indispensable tool for analysis, offering a convenient way to organize, filter, and manipulate tabular data.
Initialization of Regular Expressions: To efficiently identify and manipulate specific patterns in text, regular expressions are pre-compiled. Examples include the emoji_pattern and symbols regex, aiding in extracting or removing emojis and symbols as needed.

Each step in this data preparation journey addresses specific challenges commonly encountered in text and data processing tasks. Whether it’s handling diverse data structures, efficiently managing files, or extracting meaningful information from messages, our approach is tailored to tackle the complexities of real-world textual data. The incorporation of regular expressions and Pandas contributes significantly to the effectiveness of our script, ensuring it stands ready to navigate the intricacies of language model fine-tuning.

Emoji Removal: A dedicated method takes center stage in our data cleansing process, meticulously removing emojis from the specified input file. This step ensures that the processed text remains devoid of any non-textual elements, promoting clarity and simplicity for subsequent analyses.

Amharic Text Refinement: Specialized functions come into play to handle text containing Amharic characters. Systematic procedures are implemented to eliminate unnecessary spaces, unwanted characters, and standardize specific Amharic letters. The goal is to create a more refined and consistent representation of the text, setting the stage for accurate language model training.

DataFrame Cleaning: A comprehensive method is deployed for the thorough cleaning of a Pandas DataFrame. This includes addressing null or empty values, removing newlines, hashtags, emojis, special symbols, hyperlinks, and extra spaces. The process also ensures the removal of English characters, streamlining the text data for subsequent analysis.

Overall Text Processing: Higher-level methods orchestrate the overall text processing. One method facilitates the extraction and saving of the cleaned ‘text’ column to a text file, while another systematically iterates through parsed CSV files, cleans each line, and saves the refined text to separate text files. These methods ensure a consistent and reliable text processing routine.

In our approach to advanced language models, we’re intensifying supervised training by incorporating messages from diverse Telegram channels. This focused strategy enriches our models with varied linguistic styles and topics, cultivating adaptability.

Our curated dataset enhances understanding of language nuances, emphasizing the importance of context and user interactions within distinct online communities.

Selecting the Model

In the quest for an ideal open-source Language Model (LLM) capable of embedding Amharic texts, our journey began with a thorough model selection process. A comprehensive evaluation unfolded, employing a systematic approach to identify the most suitable contender. A spreadsheet became our canvas, where contenders were ranked based on predefined criteria:

1. Amharic Embedding Proficiency: Evaluating the model’s intrinsic capability to effectively embed Amharic language constructs.

2. Fine-tuning Flexibility: Assessing the model’s adaptability to fine-tuning processes, specifically tailoring it to generate Amharic ad content for the Telegram platform.

3. Training Efficiency: Considering the efficiency of model training, with emphasis on optimizing resources and minimizing computational costs.

4. Huggingface Integration: Ensuring seamless integration with Huggingface APIs and platform, facilitating efficient model training and fine-tuning.

Scores were assigned based on performance against these criteria, and the model with the highest cumulative score emerged as the most suitable candidate for further fine-tuning. This data-driven approach ensured an objective and informed decision-making process.

Initial Factors Considered

Before the detailed evaluation, initial factors played a crucial role in guiding the selection of four key metrics:

Performance: Overall model performance, encompassing accuracy, efficiency, and effectiveness in generating Amharic text for ad content.
Training Dataset Size: The volume of data used to train the model, indicating the breadth and diversity of information the model has been exposed to during training.
Parameter Size: The number of parameters within the model, providing insights into its complexity and computational requirements.
Use Case: Relevance to the specific application, ensuring alignment with the requirements of generating Amharic ad content for the Telegram platform.

These initial factors laid the groundwork for the subsequent evaluation process, guiding the selection of key metrics that would determine the model’s suitability and effectiveness for the intended purpose.

Pre-training our Open-source LLM model

In our journey to refine language models, we’re leveraging the Contemporary Amharic Corpus (CACO) version 1.1, encompassing around 24 million tokens. This dataset is available in both plain text and XML formats, providing raw linguistic expressions and tagged versions, respectively.

The plain text represents unprocessed language, while the XML format adds a layer of structure through tagging. This dual representation allows for a comprehensive exploration of language nuances and patterns.

To ensure accuracy, the web-based corpus has undergone proofreading, editing, and automatic spelling error corrections. Modifications to the HornMorpho morphological analyzer enhance the dataset by enabling automatic tagging.

After an initial phase of pretraining our model, we’ve attained significant progress, as reflected in the following key results:

Epoch: 0
Global Step: 20
Learning Rate: 0
Loss: 0.0811
Total FLOPs: 5,831,907,055,583,232
Train Loss: 0.4488
Train Runtime: 237.0266 seconds
Train Samples per Second: 0.338
Train Steps per Second: 0.084

These metrics showcase the model’s performance, demonstrating its ability to learn and adapt during the pretraining phase. The low loss value and efficient training metrics, such as samples and steps per second, highlight the model’s effectiveness in processing and understanding the provided data.

Fine-tuning our Open-source LLM Model

Fine-tuning an open-source Large Language Model (LLM) marks a crucial phase in adapting a pre-trained model to a specific task or domain. This process involves utilizing a smaller dataset of labeled or unlabeled data to enhance the model’s performance and accuracy on targeted natural language processing (NLP) tasks, ranging from text generation to question answering and summarization.

1. Customizing Tokenization Process

The journey begins with the customization of our tokenization process, a pivotal step that sets the stage for fine-tuning. Tokenization, the process of breaking down text into smaller units (tokens), is tailored to our project’s unique requirements. A dedicated notebook becomes the workshop for this intricate task, ensuring precision and alignment with our objectives.

2. Model Creation and Upload to Hugging Face

Having successfully implemented our tokenization process, we delved into the creation of a tailored model for our unique requirements. Through a series of supervised training sessions, we meticulously fine-tuned multiple models. Our findings revealed that the Llama2 7b model outperforms others when it comes to understanding Amharic.

Steps we took to fine-tune our model

Model Loading: The initial step involves loading a pre-trained causal language model along with its tokenizer. A specific configuration is applied during the loading process to optimize memory usage through quantization.
Configuration Creation: Configurations for BitsAndBytes (BNB) and Lora are crafted to fine-tune the model’s behavior. BNB focuses on quantization, while Lora is employed to simplify the model’s complexity.
Data Preprocessing: Efficient data preparation is crucial for successful training. Text is tokenized, prompts are formatted, and the dataset is shuffled to ensure it is ready for the training pipeline.
Training: The training process encompasses several techniques. Gradient checkpointing is enabled, preparing the model for k-bit training. The Lora configuration is applied during training to streamline the model’s complexity. A Trainer instance facilitates the training, culminating in the saving of the model’s state and metrics.
Prompt Formatting: To ensure coherence and consistency during training, the text and label fields of dataset samples are formatted into a specific prompt structure.

Upon evaluating our outcomes, it becomes evident that our pre-trained model, further refined through fine-tuning, excels across multiple metrics. Here is a snapshot of the key performance indicators:

Epoch: 0
Global Step: 20
Learning Rate: 0
Loss: 0.2487
Total FLOPs: 6,846,633,836,150,784
Train Loss: 0.2436
Train Runtime: 271.8601 seconds
Train Samples per Second: 0.294
Train Steps per Second: 0.074

These metrics provide a comprehensive overview of our model’s training process and performance. The low loss values and efficient training rates indicate the effectiveness of our approach. As we continue to refine and optimize, these insights will guide our efforts towards achieving even higher standards in natural language processing.

The model is then seamlessly uploaded to Hugging Face. This move ensures easy access and availability of our model resources on the Hugging Face platform.

3. RAG (retrieval-augmented generation)

In our quest for enhanced language understanding, we’ve crafted a language processing pipeline. Let’s delve into the key components of our framework:

Data Loading and Preprocessing: We use a custom text loader to load and segment our data into manageable chunks. This process, known as chunking, allows us to efficiently process large documents. The preprocessing stage also involves cleaning the data and removing any irrelevant information.
Vector Database Setup: We use the Weaviate vector database to store processed text chunks. This forms the basis for our semantic search capabilities. Each chunk of text is converted into a vector representation that captures its semantic meaning. These vectors are then stored in the database for quick and efficient retrieval.
Model Integration: We incorporate cutting-edge language models, such as OpenAI’s GPT-3.5 Turbo, for various tasks including chat interactions and text generation. These models are trained on a vast corpus of text and can generate human-like text based on the input they receive.
Hugging Face Embeddings: We enhance our language understanding by integrating Hugging Face embeddings. These embeddings are generated by powerful pre-trained models like BERT, GPT-2, and RoBERTa. They capture the contextual meaning of words and phrases, allowing us to understand the nuances of language.
RAG Pipeline Construction: Our pipeline includes a Retriever-Answer-Generator (RAG) architecture. This involves two steps: first, the retriever selects relevant documents based on the query; then, the answer generator produces a response based on these documents. This allows us to generate contextually relevant responses.
Model Fine-Tuning: We optimize our language models by fine-tuning them using techniques like quantization and custom configurations. Quantization reduces the size of the model without significantly affecting its performance, while custom configurations allow us to adjust the model’s parameters to suit our specific needs.
Agent Initialization: Our system is designed to initialize agents for specific tasks. These agents use predefined system messages and tools to interact with the user and perform their tasks.
Dynamic Configuration: We dynamically configure our language models based on factors such as temperature, which influences the randomness of the generated responses. A higher temperature results in more random responses, while a lower temperature makes the responses more deterministic.
Efficient Model Loading: Our model loading mechanism efficiently allocates models across available resources, including GPUs. This allows us to make the most of our computational resources and ensures that our models are loaded and ready to use as quickly as possible.
Continuous Monitoring: We continuously monitor our language models to ensure they adapt and perform optimally over time. This involves tracking their performance metrics and making necessary adjustments to their configurations.

In essence, our Retrieval Augmented Generation (RAG) system offers a tailored context generation process. This, coupled with the precise retrieval of specific sections from our extensive data repository, ensures a comprehensive and accurate language processing output.

Monitoring our training

To monitor and evaluate the ongoing progress of our model training, we established a WandB (Weights and Biases) account. This platform provides real-time tracking and visualization of key metrics during the training process. By integrating WandB into our workflow, we gain valuable insights into the model’s performance, facilitating informed decisions and optimizations throughout the training phase. Weights & Biases (WandB) is a machine learning development platform that allows users to track and visualize various aspects of their model training process in real-time.

Applications

In the context of machine learning, WandB is primarily used to:

Track model performance metrics such as accuracy, loss, and other evaluation metrics during the training and evaluation phases.
Visualize the model’s learning process using graphs, charts, and histograms to gain insights into how the model is performing.
Compare different models and their performance metrics to help choose the best-performing one.
Collaborate with others by sharing experiments and results.

Wandb is a useful tool for machine learning engineers, data scientists, and researchers who want to optimize their machine learning models and make informed decisions during development. With wandb, users can easily keep track of multiple experiments, compare results, and identify the best-performing model for a particular task.

Results and conclusion

We have successfully developed an advanced language model, fine-tuned from the Llama2 7b model, to enable quality embedding and text generation for the Amharic language. Leveraging the Contemporary Amharic Corpus (CACO) version 1.1 for pre-training, our team implemented a Retrieval-Augmented Generation (RAG) pipeline tailored for Ethiopian businesses on the Telegram platform.

The project involved a strategic approach, emphasizing clear vision definition, data-driven model selection, and innovative techniques like Parameter-efficient Tuning (PEFT) and Low-Rank Adaptation (LoRA). Robust data processing methods, including Amharic text refinement and thorough data cleaning, contributed to the success of the fine-tuning process.

Key performance indicators validated our choice of the Llama2 7b model, demonstrating low loss values and efficient training rates. The RAG pipeline showcased our commitment to enhancing language understanding, generating contextually relevant responses.

Continuous monitoring using the Weights and Biases platform ensured optimal model adaptation over time. As we transition to deployment, the focus remains on computational efficiency and user experience. The project highlights our pioneering role in harnessing language models for diverse linguistic contexts, with the developed model poised to revolutionize advertising strategies for Ethiopian businesses on Telegram.

Future Works

To enhance our language model, we suggest expanding the training dataset for a more nuanced understanding of linguistic subtleties and industry-specific contexts. Experimenting with advanced pretraining methods, including diverse masked language modeling and contrastive learning, is recommended.

Prioritize efficient fine-tuning processes, specifically tailored for Retrieval-Augmented Generation (RAG) embedding and understanding models. Refine data processing techniques and collaborate with linguistic experts and native speakers for cultural sensitivity.

Establish continuous feedback loops with end-users to adapt the model to evolving linguistic trends and user preferences. This approach ensures the model remains relevant, versatile, and effective over time.

In summary, future work should focus on expanding the training dataset, experimenting with advanced pretraining, and emphasizing efficient fine-tuning for RAG models. Collaborative efforts and user feedback will be key to achieving a culturally aware and contextually attuned language model for the Amharic-speaking community.