Training Models and Leveraging General Models, Finetuning

13 min readJun 11, 2024

This article is a product of my own research and synthesis of knowledge about various tools for AI bot development, sourced from online articles and Git repositories. It serves as a personal reference to memorize and structure my own knowledge on the topic.

Evolution of the AI Bots: Harnessing the Power of Agents, RAG, and LLM Models

Structuring knowledge about tools for AI bot development, also high-level overview of approaches, architectures and…

medium.com

Building an AI application starts with selecting a model. You can either develop a model from scratch, utilize existing ones, or fine-tune existing models.

To better understand this, visualize a neural network graph. Imagine a small neural network with vertically aligned nodes (circles): three green nodes on the left (input layer), three sets of four blue nodes in the middle (hidden layers), and two yellow nodes on the right (output layer). These vertically aligned nodes are called layers. Modern LLMs, unlike this simple example with 5 layers, have significantly more nodes, connections, and layers. GPT-4 has ~1.8 trillion parameters across 120 layers. GPT-5 might have 10 times the parameters of GPT-4.

Graph of neural network with 5 layers: One input at the left, 3 in the middle, and one output at the right.

Machine learning (ML) and LLMs operate through a series of layers and connections. Values in the input layer influence the next layer, continuing until the final hidden layer produces the output. The hidden layers, where much of the “magic” happens, are the most complex and least understood part of neural networks.

While we can comprehend small-scale neural networks, the billions of parameters in large models are beyond our grasp. This complexity creates a “wow” moment when, for instance, an image of a letter broken into smaller blocks is processed by the model to accurately identify the letter. We know the techniques and tools to tune and manipulate these hidden layers, testing and refining the model to improve results.

Despite being able to trace the activated nodes and connections, the process doesn’t make sense to a human as to why exactly neural network and their activated connections with given settings and its resulting fingerprint of activated nodes when taking input returns for instance with “a cat”. We can see the activated nodes, connections, and their values, but it’s unclear why that pattern equals to “a cat”. Yet, we know it works. Our current methods and tools allow us to achieve impressive results, even if the underlying processes are still somewhat mysterious.

You can develop smaller, specialized models for specific tasks while using general models as wrappers to handle various functions. Fine-tuning existing models is often much cheaper than training new ones from scratch. Pre-trained large models provide a much better starting point for almost any Machine Learning (ML) problem compared to training a new model by initializing its weights. This approach is known as transfer learning.

There are different ways to fine-tune, depending on whether you want to adjust the Embeddings model or the Chat LLM Model. For smaller models like Embeddings, the process is inexpensive. However, fine-tuning large models with billions of parameters requires specialized hardware, which can be costly.

To reduce costs, several fine-tuning techniques have been developed. One approach involves modifying only a subset of unfrozen nodes and connections while keeping the frozen intact. A more cost-efficient method is Low-Rank Adaptations (LoRA), which adds a new layer to the neural network instead of modifying billions of parameters.

How to Finetune

To fine-tune LoRA on Windows/Linux x86 with an NVIDIA GPU, H2O LLM Studio is an excellent choice for beginners due to its ease of use and GUI, allowing fine-tuning without coding. For macOS/ARM64, use command line tools like mlx-lm fortunately its relatively simple. The most challenging aspects of fine-tuning are:

Finding or creating a high-quality and big enough dataset in a format expected by your base trainee model to teach and “convince” it with
Having the necessary hardware for processing.

The steps to fine-tune include:

Preparing the dataset
Installing the fine-tuning framework such as H2O or mlx-lm and dependencies
Creating a fine-tuning configuration
Running the fine-tuning process to generate the trained adapter weights with the changes.
Afterward, you may fuse the generated changes in the adapter with the base model into a single model
Convert it to the wildly supported gguf format, and possibly quantize it to reduce the precision of the model.

9 Open-Source Tools to Generate Synthetic Data

If you want to generate synthetic data to address concerns about data scarcity, privacy, compliance, and other issues…

opendatascience.com

LoRA: Application, Challenges, and Solutions

LoRA is designed to balance quality, simplicity, speed, and cost. Training a new model from scratch, especially those with hundreds of billions of parameters, is exceedingly expensive, often costing millions solely for electricity. LoRA offers a more affordable alternative.

While full training delivers the highest quality results, LoRA finetuning requires significantly less data and resources, yielding good outcomes. However, this comes at the expense of the model’s world knowledge and ground truth. The training costs for GPT-4 were around $63 million, taking into account the computational power required and the time of training.

The Challenge

Finetuning with LoRA can be compared to using a semi-transparent, unevenly shaped lens. As you finetune the model, it becomes more proficient in your specific area of expertise, but this also dims its world knowledge and inflates the problem of “catastrophic forgetting”. LoRA finetuning, while useful, will impair the ground truth and knowledge of the base model and in some cases, finetuning might even make the model work worse than without it.

The Solution: Dynamic routing to MoE

To address the challenges of LoRA finetuning, you can generate an adapter file containing your expert knowledge. If testing shows that the adapter improves performance, instead of merging it with the base model to produce a new fine-tuned model, you can employ techniques that dynamically route to this adapter only when the user prompt requires specific expertise. Otherwise, the system can route to the world knowledge of the base model.

Various methods have been developed to optimize this dynamic routing, each with different names, approaches, and levels of effectiveness dancing around the same core idea. Additionally, they extend this concept to route between multiple LoRA adapters with a Mixture of Experts (MoE). Here are some notable techniques and keywords:

LoRA Based Expert, LoRA-MoE methods, Switch-LoRA, X-LoRA, MIXLoRA, LoRA-MoE, LoRA adapters with a massive Mixture of Memory Experts (MoME), Mixture of LoRA Experts, LoRAMoE, LoRAMoE, LoRA MoE plugin, LoRAs as experts, Mixture of LoRA Experts, MoCLE, MOELoRA, MoELoRA, MOLA, MoRAL, MoA, MixLoRA, MOLA, AdapterFusion, LoRAHub, m-LoRA / mLoRA framework.

For example, you can build X-Lora and run with Mistral.rs inference. And there’s another notable similar but distinct approach by merging multiple models into a single model with mergekit.

Auto Fine-tune

Some services allow you to auto-finetune models with provided documents such as jina.ai/fine-tuning for embeddings models. Though don’t expect magic, they can improve quality not significantly.

MLOps, LLMOps, and GenOps practices

While you certainly can train and finetune LLM models manually, there are a set of automation practices such as MLOps, LLMOps, and GenOps providing guidelines and blueprints on how to automate and set workflows for continuous improvements. Based on those practices you can find various LLMOps tools, MLOps tools, and implementations. These practices not only focus on the LLM model quality but also include RAG/Agents. Dagworks-inc/hamilton is no MLOps tool per se but is a Directed Acyclic Graph (DAG) library that can be used for RAG workflows.

LLMOps, an extension of MLOps tailored specifically for LLMs, encompasses Continuous Integration (CI), Continuous Evaluation (CE), and Continuous Deployment (CD). It focuses on deploying, monitoring, and maintaining LLMs, addressing their challenges such as high computational demands, privacy concerns, and the handling of extensive text data. LLMOps further improves this approach by incorporating updates through fine-tuning and reinforcement learning from human feedback (RLHF). Examples: Zenml, Pezzo, Langcorn, and Allegroai/Clearml.

Generative AI Operations (GenOps) extends LLMOps to cover the lifecycle management of generative models across all media types, including text, image, audio, and video, and supports multi-modal systems. GenOps promotes operational efficiency, reduces complexity, and ensures a standardized process for managing generative models, streamlining their deployment and maintenance. Platforms like KubeFlow and MLFlow provide a general approach for managing MLOps, emphasizing automation, scalability, and reproducibility throughout the model lifecycle but can be adapted for GenOps.

Directions for Model Development & Improvement

The development and improvement of LLMs can be classified into two main categories:

Scale Up: Increasing the number of parameters, expanding storage, investing in more powerful hardware, and spending more money.
Enhance Quality: Utilizing the same resources more effectively by implementing better algorithms, processing training data more intelligently, and using specialized, cost-efficient hardware for better performance.

Ultimately, these approaches will converge to optimize model development and results.

Large Models: Unification

Historically, AI models have been developed with a narrow focus, targeting specific types of data such as text, images, audio, or video. For instance, earlier text models were typically limited to processing English, and image models often provided only text descriptions (decoder) or generated images (encoder), but not both. However, there is now a significant trend towards unification, merging these capabilities in a specific way to create more versatile and universal models.

Many contemporary models are multi-modal, capable of simultaneously working with multiple languages, text and multimedia, encoding and decoding. This trend enhances the versatility and utility of AI models, enabling them to handle a broader range of tasks. Examples:

Advanced translation models now go beyond text-to-text translations; they can translate text from images and audio and can output translations in text or audio formats, handling multiple languages within a single prompt. Coding models have also evolved to generate and convert between various programming languages (encoding) and to decode existing code by explaining and summarizing it. Specialized embedding models are capable of working with multiple languages. Document processing models can identify the layout of documents, separating images and text. Vision models can decode and encode data between text, images, and videos and produce Image-to-Image, Image-to-Video, and even 3D.

Large Models: Domain-Knowledge

Parallel to the unification trend, there is a growing trend in domain-knowledge models. Though this might seem contradictory, these trends coexist and complement each other.

For example, models with domain knowledge in biology, diseases, healthcare, law, finance, tax, banking, programming languages, and mathematics.

Smaller Models Narrow Specialization

While large models tend to acquire general and domain knowledge, there’s another trend for some models to solve very narrow and specific tasks that they can do effectively and efficiently while keeping the model relatively small. Examples:

Reranking to enhance the ordering of search results or recommendations. Text-to-Text question answering verifies if the context contains an answer to a given question. Language detection identifies the language of a given text. Table question answering addresses complex questions based on tabular data. Topic classification categorizes text into predefined topics. Punctuation restoration reinstates punctuation in the text. Part-of-speech tagging labels words in a text with their corresponding parts of speech. Sentiment analysis evaluates the sentiment of the financial text. Gibberish or nonsensical input detection identifies nonsensical or irrelevant input. Question-answer pairing matches questions with their correct answers. Zero-shot text classification tags text based on its content without needing labeled training data. Named entity recognition (NER) identifies and categorizes named entities within a text, and relation extraction (RE) determines the relationships between those entities.

Dual-Finetuning or Co-Finetuning

Imagine you are using more than one model in your application, for instance, re-rank, embeddings, and general model. Now instead of fine-tuning each model individually, you can create datasets for each of the models that would be interconnected creating an alignment.

Function Call

A crucial feature that new models are starting to acquire is “Function Calling.” This capability forces the model to respond in a predictable and strictly formatted manner, following predefined rules. For instance, if asked to produce a valid JSON format, the model will comply. This feature enhances the integration of LLMs with existing applications, allowing them to act as substitutes for human-like decision logic to automate and call different application functions.

Model Formats

Here are some key formats to be familiar with:

PyTorch

File Extensions: Typically uses .pt or .pth extensions, and sometimes .bin with PyTorch in the name of the model.
Serialization: Often involves pickle flavor serialization.
Primarily developed by Facebook’s AI Research Lab (FAIR).
Usage: Popular on AI model hubs like Hugging Face. Hugging Face can convert PyTorch models into the Safetensors format providing both options. When downloading a model, ensure you obtain all files in the repository, not just the PyTorch files.

TensorFlow by Google.

TF2 SavedModel is the recommended format for sharing TensorFlow models. It utilizes protobuf .pb (Protocol Buffer).
.TFLite format is used for on-device inference.
TF.js format: This format is used for in-browser machine learning.
TensorFlow Models can be converted to PyTorch format via an intermediary format like ONNX.

Safetensors

Description: Developed by Hugging Face, this format safely stores tensors, preventing the execution of malicious code hidden in files in contrast to PyTorch Pickle.
File Extension: .safetensor
Usage: Popular on AI model hubs like Hugging Face. When downloading a model, ensure you obtain all files in the repository, not just the .safetensor files.

GGUF (GPT-Generated Unified Format)

Description: A popular binary format designed for fast loading and saving of models, making it highly efficient for inference purposes. Contains all the necessary information and metadata to run a model, unlike others.
Conversion: Models developed in frameworks like PyTorch and Safetensors can be converted to GGUF using tools like llama.cpp or ollama quintize.
File Extension:.gguf or .bin with gguf in the name of the model
Usage: GGUF models may consist of one or multiple files. For example, a model might be split into multiple GGUF files if they are too large (e.g., Meta-Llama-3–70B-Instruct.Q6_K-00001-of-00002.gguf and Meta-Llama-3–70B-Instruct.Q6_K-00002-of-00002.gguf). If running inference on multiple GGUF files, save them in the same folder and point your inference to the first GGUF file.
Legacy Format: GGML is the older version of GGUF.

Apple CoreML

Description: It's a Framework and Model package format. Optimized for Apple on-device performance of variety of models with Apple Silicon ARM64 CPU, GPU, and notably also Apple Neural Engine (ANE) and minimizing memory footprint and power consumption. Other inferences are currently capable of utilizing only CPU & GPU.
File extension: .mlpackage or .bin and CoreML in the name of the file
Usage: with CoreML Framework and Xcode. You can convert TensorFlow v1, TensorFlow v2, and PyTprch to CoreML package format with CoreML Tools.

Quantization

Is a process of reducing the quality of the model to reduce memory footprint. GGUF file can be quantized. Typically models of 32-bit format are considered Full Presidion, 16-bit is half-precision and so on, you can go as small as Q2. For Example, Meta-Llama-3–70B-Instruct.Q6_K-00002-of-00002.gguf This model is Quantized with 6 bits. K stands for K-Means, instead of reducing the quality of all the values in your model equally with Q6, the K-Means precision reduction algorithm will round up values that are not important which reduces space footprint without too much damaging the quality. For instance, the Naïve precision reduction of Q6 might be too big for your computer’s memory, while Q6_K might be just right and almost as good as Q6.

Quantization is the process of reducing a model’s precision to decrease its memory footprint while maintaining acceptable performance levels.

The easiest way to convert a model to GGUF and Quantize

docker run ghcr.io/ggerganov/llama.cpp. The models will be in the same folder with .bin extension. That’s it!

medium.com

Here’s how it works and its implications:

Quantization Levels

Full Precision (32-bit): The highest quality, using 32 bits for each parameter.
Half Precision (16-bit): Uses 16 bits. With llama.cpp & Ollama considered as a “standard” to run a model on a personal computer.
Lower Precision (Q2, Q4, Q6, Q8): These formats use even fewer bits, with Q2 being the smallest option, meaning the biggest precision loss and the lowest quality.

A GGUF file can be quantized to various precision levels. For example, Meta-Llama-3–70B-Instruct.Q6_K-00002-of-00002.gguf is a model quantized to 6 bits.

32-bit is a Full Precision. Naïve Quality Reduction: 16-bit Half Precision; 2-bit minimal

Choosing the Right Quantization

K-Means: This technique further optimizes the model by rounding non-critical values, reducing the memory footprint without significantly impacting quality. For instance, Q6_K combines 6-bit quantization with K-Means precision reduction.
Memory Constraints: If a full precision model (32-bit) is too large for your computer’s memory, you can opt for a lower precision like Naïve Quality Reduction Q6. If Q6 alone is still too large, Q6_K can be a more suitable alternative, providing nearly the same precision with reduced memory requirements.

Author: Ingrid Stevens; Naïve even quality reduction vs. K-Means selective quality reduction

Read a very good and detailed explanation of the Quantization in this article.

Quantization of LLMs with llama.cpp

Understanding and Implementing n-bit Quantization Techniques for Efficient Inference in LLMs

medium.com

Annotation

When you decide to go with finetuning or training a new model, it's important that your dataset could be as accurate as possible, and annotation tools such as HumanSignal/label-studio can help in providing additional data and filter out data that is not optimal for model your purposes.

The Loss Function

When fine-tuning a model, the first step is to evaluate its performance. The loss function, or error function, measures how well your model predicts the expected outcome. It quantifies the difference between the model’s predicted outputs and the actual results.

Loss functions are broadly categorized into two types: classification and regression.

Enjoyed This Story?

If you like this topic and you want to support me:

Clap 👏 my article 10 times; that will help me out
Follow me on Medium to get my latest articles 🫶
Share this article on social media ➡️🌐
Give me feedback in the comments 💬 below. It’ll help me to better understand that this work was useful, even a simple “thanks” will do. Give me good, give me bad, whatever you think as long as you tell me place to improve and how.

Training Models and Leveraging General Models, Finetuning

Evolution of the AI Bots: Harnessing the Power of Agents, RAG, and LLM Models

Structuring knowledge about tools for AI bot development, also high-level overview of approaches, architectures and…

How to Finetune

9 Open-Source Tools to Generate Synthetic Data

If you want to generate synthetic data to address concerns about data scarcity, privacy, compliance, and other issues…

LoRA: Application, Challenges, and Solutions

The Challenge

The Solution: Dynamic routing to MoE

Auto Fine-tune

MLOps, LLMOps, and GenOps practices

Directions for Model Development & Improvement

Large Models: Unification

Large Models: Domain-Knowledge

Smaller Models Narrow Specialization

Dual-Finetuning or Co-Finetuning

Function Call

Model Formats

Quantization

The easiest way to convert a model to GGUF and Quantize

docker run ghcr.io/ggerganov/llama.cpp. The models will be in the same folder with .bin extension. That’s it!

Quantization of LLMs with llama.cpp

Understanding and Implementing n-bit Quantization Techniques for Efficient Inference in LLMs

Annotation

The Loss Function

Enjoyed This Story?

Written by D