LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

How to fine-tune LLMs on custom datasets at Scale using Qwak and CometML

How to fine-tune a Mistral7b-Instruct model, leveraging best MLOps practices using Qwak for cloud deployments at scale and CometML for experiment management.

Razvant Alexandru

Published in

Decoding ML

19 min readMay 18, 2024

→ the 7th out of 11 lessons of the LLM Twin free course

What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM.

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

Why should you care? 🫵
→ No more isolated scripts or Notebooks!
Learn production ML by building and deploying an end-to-end production-grade LLM system.

What will you learn to build by the end of this course?

You will learn how to architect and build a real-world LLM system from start to finish — from data collection to deployment.

You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.

The end goal? Build and deploy your own LLM twin.

What is an LLM Twin? It is an AI character that learns to write like somebody by incorporating its style and personality into an LLM.

The architecture of the LLM twin is split into 4 Python microservices:

the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)
the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard (deployed on Qwak)

LLM twin system architecture [Image by the Author]

Along the 4 microservices, you will learn to integrate 3 serverless tools:

Comet ML as your ML Platform;
Qdrant as your vector DB;
Qwak as your ML infrastructure;

Who is this for?

Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps sound principles.
Level: intermediate
Prerequisites: basic knowledge of Python, ML, and the cloud

How will you learn?

The course contains 11 hands-on written lessons and the open-source code you can access on GitHub.

You can read everything at your own pace.
→ To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.

Costs?

The articles and code are completely free. They will always remain free.

But if you plan to run the code while reading it, you must know that we use several cloud tools that might generate additional costs.

The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.

For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.

Meet your teachers!

The course is created under the Decoding ML umbrella by:

Paul Iusztin | Senior ML & MLOps Engineer
Alex Vesa | Senior AI Engineer
Alex Razvant | Senior ML & MLOps Engineer

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Lessons

The course is split into 11 lessons. Every Medium article will be its lesson.

To better understand the course’s goal, technical details, and system design → Check out Lesson 1

Let’s start with Lesson 7 ↓↓↓

Lesson 7: How to fine-tune an optimized Mistral7b-Instruct LLM using Qwak and CometML

This lesson will focus on engineering and deploying the fine-tuning pipeline for our LLM Twin model.

Before doing that, let’s walk through a short recap, to understand how we’ve got to this fine-tuning stage:

→ In Lesson 2 — we’ve described the process of data ingestion where we’re scrapping articles from Medium, posts from LinkedIn, and Code snippets from GitHub and storing them in our Mongo Database.

→ In Lesson 3, we’ve showcased how to listen to MongoDB Oplog via the CDC pattern, and adapt RabbitMQ to stream captured events, this is our ingestion pipeline.

→ In Lesson 6 — we’ve showcased how to use filtered data samples from our QDrant[12]. Using Knowledge Distillation, we have the GPT3.5 Turbo to structure and generate the fine-tuning dataset that is versioned with CometML.

In Lesson 7, we will build the fine-tuning pipeline using the versioned datasets we’ve logged on CometML, compose the workflow, and deploy the pipeline on Qwak [2] to train our model.

Further, apart from covering the model selection, PEFT and QLoRA configs, LLM special tokens, and the overall model training process, we’ll review the bits and pieces of how Qwak works and showcase the CometML experiment tracking and model versioning logic.

Completing this lesson, you’ll gain a solid understanding of the following:

what is Qwak AI and how does it help solve MLOps challenges
how to fine-tune a Mistral7b-Instruct on our custom llm-twin dataset
what is PEFT (parameter-efficient-fine-tuning)
what purpose do QLoRA Adapters and BitsAndBytes configs serve
how to fetch versioned datasets from CometML
how to log training metrics and model to CometML
understanding model-specific special tokens
the detailed walkthrough of how the Qwak build system works

Without further ado, let’s dive into the topics and cover them individually.

LLM Twin Fine-tuning workflow. Image by author.

🔗 Check out the code on GitHub [1] and support us with a ⭐️

What is LLM Finetuning
a. PEFT — parameter-efficient-fine-tuning
b. Lora — Low Rank Adaptation
c. BitsAndBytes
Qwak AI Platform
a. How it targets MLOps
b. Cost System
c. Prerequisites
d. The Build Lifecycle
Mistral7b-Instruct LLM Model
a. ModelCard
b. Hugging Face Setup
c. Tokenizer Special Tokens
The Finetuning Pipeline
a. System Design
b. Implementation
c. Deployment on Qwak
Experiment Tracking with CometML
Ending Notes and Conclusion

What is LLM Finetuning

Represents the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain. Fine-tuning [5] is about turning general-purpose models and turning them into specialized models.

Foundation models know a lot about a lot, but for production, we need models that know a lot about a little.

In our LLM-Twin use case, we’re aiming to fine-tune our model from a general knowledge corpora towards a targeted context that reflects your writing persona.

PEFT — parameter-efficient-fine-tuning

A technique designed to adapt large pre-trained models to new tasks with minimal computational overhead and memory usage. It involves reusing the pre-trained model’s parameters and fine-tuning them on a smaller dataset, saving computational resources and time compared to training the entire model from scratch.

🔗 Find more about PEFT [6].

QLoRA — Quantized Low-Rank Adaptation

A specific PEFT technique that enhances the efficiency of fine-tuning LMs by introducing low-rank matrices into the model’s architecture, capturing task-specific information without altering the core model weights.

It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the transformer architecture, greatly diminishing the number of trainable parameters for downstream tasks.

🔗 Find more about QLoRA [7].

BitsAndBytes

Is a library designed to optimize the memory usage and computational efficiency of large models by employing low-precision arithmetic. Underneath, it uses custom CUDA kernel implementations that allow for lower precision operations within Transformer-based models.

While PEFT and LoRA focus on reducing the number of trainable parameters, BitsAndBytes configs help reduce the precision of these parameters, leading to even greater resource savings.

🔗 Find more about BitsAndBytes [8].

Qwak AI Platform

An ML engineering platform that simplifies the process of building, deploying, and monitoring machine learning models, bridging the gap between data scientists and engineers. For more details, see Qwak [2].

Key points within the ML Lifecycle that Qwak [2] solves:

Deploying and iterating on your models faster
Testing, serializing, and packaging your models using a flexible build mechanism
Deploying models as REST endpoints or streaming applications
Gradually deploying and A/B testing your models in production
Build and Deployment versioning
Selective GPU Instance Pooling and Scheduling

Qwak Cost System

Qwak provides both CPU and GPU-powered instances based on the QPU quota. The QPU [4] stands for qwak-processing-unit and it helps users manage their platform quota. A QPU [4] is the equivalent of 4 CPUs with 16 GB RAM.
ℹ️ See pricing page here: Qwak Pricing

The freemium version allows for 100 QPU/month which is enough to cover the LLM Twin course requirements for fine-tuning.

Prerequisites

To access the platform, head over to Qwak [2] and create an account using the Start Free from the up-left side. Next, you’ll need the API_KEY to be able to work with the CLI tool.

Once logged in, on the left bar, head over to Settings then under Personal Settings select Personal API keys, generate a new key, and copy it to the clipboard.

Next, you would have to install the qwak-sdk to interact with the platform.

# PIP
pip install qwak-sdk

# POETRY
poetry add qwak-sdk

Next, let’s configure the Qwak workspace. Run qwak configure and you’ll be prompted with “Please enter your API key:”, paste the key, and done.

Once we have configured the qwak-sdk tool, and have created an account on Qwak, let’s go ahead and inspect how the Qwak build process works and what the Model Blueprint looks like.

The Build Lifecycle

Now, let’s understand how exactly the Qwak build system works and iterate on how to define a model schema, model interface, build steps, and deployment workflow.

Let’s start with the Python Project blueprint.

Here’s the folder structure for a new Qwak build, that will further encapsulate our model and functionality when deploying it on Qwak.

[QwakNewModelBuild]
|--- main/
|   |- __init__.py 
|   |- requirements.txt   
|   |- model.py    
|--- tests/
|   |- __init__.py
|   |- unit_tests.py
|
|--- test_local_model.py
     # intended to test the model with `run_local` on your machine to validate it before pushing to qwak
|--- test_live_model.py
     # code to test the model in the process of Running Tests from above. 
     # Basically involves a `qwak_inferece.RealTimeClient` class that wraps your model and passes a dummy input through it.

Key points from here:

__init__.py : This contains a single method `load_model()` which returns a instance of model.ClassName.
requirements.txt : Represents our environment package, which can be replaced with either pyproject.toml or conda.yaml .
model.py : The model class implementation, where we’ll implement the QwakModelInterface.

[QwakModel] class implements these methods:        
|
|-- build      - called on `qwak build .. from cli` at build time.
|-- schema     - specifies model inputs and outputs
|-- initialize_model - invoked when model is loaded at serving time.
|-- predict    - invoked on each request to the deployment's endpoint.

! Important
The predict method is decorated with qwak.api() which provides qwak_analytics
on model inference requests.

These are under main folder and represent the required schema such that our model can pass the build.

Apart from that, we have the:

tests : folder to group our custom unit tests and integration tests.
test_local_model.py : deploys our model locally and tests the model integrity and workflow.
test_live_model.py : once the model is remotely deployed on Qwak, we can test it using this script.

🔗 More insights on using Qwak from the team. Qwak Publication [3]

Mistral7b-Instruct LLM Model

As mentioned above, we’ll fine-tune a Mistral7b-Instruct [10] model in our LLM-Twin course use case.

Model Card
Mistral 7B is a 7 billion parameter LM that outperforms Llama 2 13B on all benchmarks and rivals Llama 1 34B in many areas. It features Grouped-query attention for faster inference and Sliding Window Attention for handling longer sequences efficiently. It’s released under the Apache 2.0 license.

Hugging Face Setup
To be able to download the model checkpoint, and further use it for fine-tuning, we need a Hugging Face Access Token. Here’s how to get it:

Log-in to HuggingFace [9]
Head over to your profile (top-left) and click on Settings.
On the left panel, go to Access Tokens and generate a new Token
Save the Token

We’ll set this token as a env variable in our fine-tuning setup.

Tokenizer Special Tokens
Before diving into the fine-tuning module and functionality, let’s get a refresher on what the special tokens represent and why they differ for LLM models.

If we go to Mistral7b-Instruct [10] model page and select Files and Versions we’ll get prompted to this view:

For Mistral7b Instruct, the special_tokens_map.json includes the following tokens "bos_token": "<s>", "eos_token": "</s>", and "unk_token": "<unk>". These tokens define the start and end delimiters for prompts.

For the Instruct model version of Mistral, two new tokens [INST] and [/INST] are used within the prompt scope <s>[INST]....[/INST]</s>. Since the model is instruction-based, these tokens help separate the instructions, improving the model's ability to understand and respond to them effectively.

The Finetuning Pipeline

Now that we’ve covered the fundamentals of each topic, let’s put them all together and cover the implementation and fine-tuning process.

System Design
The fine-tuning process bases itself on the following system design.

We have our prepared dataset files versioned in CometML [11], from the previous lesson.
We implement the Model Schema and the fine-tuning logic following the Qwak Model Blueprint.
When a build is triggered, we deploy our model, fetch the data, fine-tune the model, and log parameters to CometML.

Implementation

As the starting point, here’s how our fine-tuning module’s folder structure would look like:

|--finetuning/
|  |__ __init.py__
|  |__ config.yaml
|  |__ dataset_client.py
|  |__ model.py
|  |__ requirements.txt
|  |__ settings.py
|
|__ .env
|__ build_config.yaml
|__ Makefile
|__ test_local.py

For the Qwak[2] remote deployment, we would focus only on what’s under the finetuning folder, as the rest of the files are applicable only on development environment.

Let’s start unpacking them, one by one:

The config.yamlcontains the training parameters for our model.

training_arguments:
  output_dir: "mistral_instruct_generation"
  max_steps: 10
  per_device_train_batch_size: 1
  logging_steps: 10
  save_strategy: "epoch"
  evaluation_strategy: "steps"
  eval_steps: 2
  learning_rate: 0.0002
  fp16: true
  remove_unused_columns: false
  lr_scheduler_type: "constant"

2. The dataset_client.py script holds the logic to interact with our project on CometML [11] and download the dataset artifacts.

Here, we’re using two main methods:

get_artifact — to connect to CometML and download the dataset artifacts.
split_data — to load the downloaded dataset, and prepare train/val splits.

Our versioned dataset looks like this:

[
  {
    "instruction": "Design and build a production-ready feature pipeline.."
    "content": "SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \\u2014 in Real-Time!Use a Python streaming engine to populate a feature store ..."
  },
...
  {
    "instruction": "Generate a publication that offers battle-tested content on building production-grade ML systems leveraging good SWE and MLOps practices...",
    "content": DecodingML, The hub for continuous learning on ML system design, ML engineering, MLOps, LLMs and computer vision..."
  }
]

🔗 Check the DatasetClient implementation for more details.

3. In model.py we’re wrapping our Mistral7b-Instruct model as a Qwak model, and implementing the required stages discussed above in the Qwak Build Cycle.

As a recap, here’s the QwakModel interface we’re going to implement:

class QwakModel:
    """
    Base class for all Qwak based models.
    """
    @abstractmethod
    def build(self):
        raise ValueError("Please implement build method")

    @abstractmethod
    def predict(self, df):
        raise ValueError("Please implement predict method")

    def initialize_model(self):
        pass

    def schema(self) -> ModelSchema:
        pass

And here’s the method map of our model class:

class CopywriterMistralModel(QwakModel):
    def __init__(
        self,
        is_saved: bool = False,
        model_save_dir: str = "./model",
        model_type: str = "mistralai/Mistral-7B-Instruct-v0.1",
        comet_artifact_name: str = "cleaned_posts",
        config_file: str = "./finetuning/config.yaml",
    ):
        
    def _prep_environment(self):
    
    def _init_4bit_config(self):

    def _initialize_qlora(self, model: PreTrainedModel) -> PeftModel:

    def _init_trainig_args(self):
    
    def _remove_model_class_attributes(self):
    
    def load_dataset(self) -> DatasetDict:
        
    def preprocess_data_split(self, raw_datasets: DatasetDict):

    def generate_prompt(self, sample: dict) -> dict:

    def tokenize(self, prompt: str) -> dict:

    def init_model(self):
                         
    def build(self):
        
    def initialize_model(self):
        
    def schema(self) -> ModelSchema:
        
    @qwak.api(output_adapter=DefaultOutputAdapter())
    def predict(self, df):

Diving into the model.py we start by defining the CopywriterMistralModel class and its constructor:

...
from qwak.model.base import QwakModel

class CopywriterMistralModel(QwakModel):
    def __init__(
        self,
        is_saved: bool = False,
        model_save_dir: str = "./model",
        model_type: str = "mistralai/Mistral-7B-Instruct-v0.1",
        comet_artifact_name: str = "cleaned_posts",
        config_file: str = "./finetuning/config.yaml",
    ):
        self._prep_environment()
        self.experiment = None
        self.model_save_dir = model_save_dir
        self.model_type = model_type
        self.comet_dataset_artifact = comet_artifact_name
        self.training_args_config_file = config_file
        if is_saved:
            self.experiment = Experiment(
                api_key=settings.COMET_API_KEY,
                project_name=settings.COMET_PROJECT,
                workspace=settings.COMET_WORKSPACE,
            )

    def _prep_environment(self):
        os.environ["TOKENIZERS_PARALLELISM"] = settings.TOKENIZERS_PARALLELISM
        th.cuda.empty_cache()
        logging.info("Emptied cuda cache. Environment prepared successfully!")

We’re going to use constructor variables throughout the Qwak lifecycle methods.

Next, we have a series of methods to prepare the BitsAndBytes, QLora, and Training arguments.
In _init_4bit_config we’re instantiating the BitsAndBytes config that’ll allow us to run operations in lower precision during training — saving computing and time.

def _init_4bit_config(self):
    self.nf4_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=th.bfloat16,
    )
    if self.experiment:
        self.experiment.log_parameters(self.nf4_config)
    logging.info(
        "Initialized config for param representation on 4bits successfully!"
    )

In _initialize_qlora we’re adding QLoRAAdapter on top of our model to mark which layers we’re going to finetune.

def _initialize_qlora(self, model: PreTrainedModel) -> PeftModel:
    self.qlora_config = LoraConfig(
        lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM"
    )

    if self.experiment:
        self.experiment.log_parameters(self.qlora_config)

    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, self.qlora_config)
    logging.info("Initialized qlora config successfully!")
    return model

In _init_training_args() we’re loading the training config and logging it to our CometML experiment.

def _init_trainig_args(self):
    with open(self.training_args_config_file, "r") as file:
        config = yaml.safe_load(file)
    self.training_arguments = TrainingArguments(**config["training_arguments"])
    if self.experiment:
        self.experiment.log_parameters(self.training_arguments)
    logging.info("Initialized training arguments successfully!")

In _remove_model_class_attributes we’re deleting the defined model, trainer and comet experiment to skip the serialization when building the Qwak artifact.

Next, we define the methods that’ll interact with the DatasetClient class and prepare our data for fine-tuning.

The generate_prompt() method wraps a data sample with Mistral7b Instruct special tokens:

def generate_prompt(self, sample: dict) -> dict:
        full_prompt = f"""<s>[INST]{sample['instruction']}
        [/INST] {sample['content']}</s>"""
        result = self.tokenize(full_prompt)
        return result

2. The load_dataset() handles our data preparation (download, split, and pre-process). In the end, we’ll have our fine-tuning samples as valid prompts with instruction/content fields ready for training.

def load_dataset(self) -> DatasetDict:
    dataset_handler = DatasetClient()
    train_data_file, validation_data_file = dataset_handler.download_dataset(
        self.comet_dataset_artifact
    )
    data_files = {"train": train_data_file, "validation": validation_data_file}
    raw_datasets = load_dataset("json", data_files=data_files)
    train_dataset, val_dataset = self.preprocess_data_split(raw_datasets)
    return DatasetDict({"train": train_dataset, "validation": val_dataset})

def preprocess_data_split(self, raw_datasets: DatasetDict):
    train_data = raw_datasets["train"]
    val_data = raw_datasets["validation"]
    generated_train_dataset = train_data.map(self.generate_prompt)
    generated_train_dataset = generated_train_dataset.remove_columns(
        ["instruction", "content"]
    )
    generated_val_dataset = val_data.map(self.generate_prompt)
    generated_val_dataset = generated_val_dataset.remove_columns(
        ["instruction", "content"]
    )
    return generated_train_dataset, generated_val_dataset

In tokenize() we’re passing our prompt through the tokenizer.

In init_model(self) we’re connecting to HF and downloading the Mistral7B-Instruct checkpoint, setting the model and the tokenizer as class instance attributes.

Next up is the build method which encapsulates the overall fine-tuning process functionality.

def build(self):
    self._init_4bit_config()
    self.init_model()
    if self.experiment:
        self.experiment.log_parameters(self.nf4_config)
    self.model = self._initialize_qlora(self.model)
    self._init_trainig_args()
    tokenized_datasets = self.load_dataset()
    self.device = th.device("cuda" if th.cuda.is_available() else "cpu")
    self.model = self.model.to(self.device)
    self.trainer = Trainer(
        model=self.model,
        args=self.training_arguments,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        tokenizer=self.tokenizer,
    )
    logging.info("Initialized model trainer")
    self.trainer.train()
    logging.info("Finished model finetuning!")
    self.trainer.save_model(self.model_save_dir)
    logging.info(f"Finished saving model to {self.model_save_dir}")
    self.experiment.end()
    self._remove_model_class_attributes()
    logging.info("Finished removing model class attributes!")

Here, we’re doing the following:

Prepare the BitsAndBytes config, log it to CometML [11], and initialize the model.
Apply the QLoRAAdapter and prepare training arguments from our defined config.yaml .
Instantiate the Transformer’s Trainer class that wraps the model training loop functionality.
Train the model using self.trainer.train().

Now that we’ve covered the implementation details, let’s see how to trigger the process and deploy this on Qwak [2].

Deployment on Qwak

Before the actual deployment, let’s make sure we have created a new project and model in Qwak and have populated the required env variables set in place.

Create a new Qwak model and project, names which we’ll use when configuring the build_config.yaml .

qwak models create "ModelName" --project "ProjectName"

Next, let’s populate the environment variables.

HUGGINGFACE_ACCESS_TOKEN: str = ""
COMET_API_KEY: str = ""
COMET_WORKSPACE: str = ""
COMET_PROJECT: str = ""

In order to get the CometML-related variables, head over to CometML [11] and log in. The next step is to create a New Project using the button on the top left corner. You’ll be prompted to this view:

Once you’ve created a project, populate the COMET_PROJECT env variable.
To get the COMET_WORKSPACE , copy the name on the right of the Comet’s logo, in my case is joywalker .

To generate a new API_KEY , in your Comet dashboard, go to your profile, select API Key, click on Manage API Keys, and generate a new key.

We’re all set!

Let’s now check how the build_config.yaml streamlines our Qwak deployment with a single command.

build_env:
  docker:
    assumed_iam_role_arn: null
    base_image: public.ecr.aws/qwak-us-east-1/qwak-base:0.0.13-gpu
    cache: true
    env_vars:
      - HUGGINGFACE_ACCESS_TOKEN="your-hf-token"
      - COMET_API_KEY="your-comet-key"
      - COMET_WORKSPACE="comet-workspace"
      - COMET_PROJECT="comet-project"
    no_cache: false
    params: []
    push: true
  python_env:
    dependency_file_path: finetuning/requirements.txt
    git_credentials: null
    git_credentials_secret: null
    poetry: null
    virtualenv: null
  remote:
    is_remote: true
    resources:
      cpus: null
      gpu_amount: null
      gpu_type: null
      instance: gpu.a10.2xl
      memory: null
build_properties:
  branch: finetuning
  build_id: null
  model_id: "your-model-name"
  model_uri:
    dependency_required_folders: []
    git_branch: master
    git_credentials: null
    git_credentials_secret: null
    git_secret_ssh: null
    main_dir: finetuning
    uri: .
  tags: []
deploy: false
deployment_instance: null
post_build: null
pre_build: null
purchase_option: null
step:
  tests: true
  validate_build_artifact: true
  validate_build_artifact_timeout: 120
verbose: 0

Let’s unpack this Qwak deployment configuration file:

We’re starting from a qwak-sdk build the image with GPU
Under the python_env tag we’re specifying how to install container requirements.
Under the remote:resources tag we’re specifying the instance type we want the deployment to be scheduled on.
Under the build_properties we’re specifying the root path of where our QwakModel definition is (e.g in the finetuning folder) using model_uri:main_dir.
We don’t run any pre-build or post-build functionality.
Under the step tag we’re selecting to run tests and to validate the Qwak artifacts once the Build stage is done.

The validate_build_artifact will run once build is complete. It wraps the deployment container and checks it’s health, ensuring it can be deployed correctly.

Now, to trigger the build on Qwak [2], we would use the pre-defined command in our Makefile qwak models build -f build_config.yaml

Below, you can find a snapshot of the Running Build function stage on Qwak.

Image by Author. Snapshot of Qwak Build Stages

Experiment Tracking with CometML

Once we’ve successfully deployed the fine-tuning module, let’s inspect the Experiments we’ve tracked on CometML [11].

Image by Author. Comet Experiments Dashboard

Upon selecting an experiment, we’re prompted to a detailed view with the parameters, code, metrics, and other metadata fields and artifacts we’ve logged.

Image by Author. Detailed Experiment View

Here, we can inspect:

The model definition summary of layers and modules using Graph definition
Hyperparameters and Metrics logged.
System metrics (GPU, CPU usage upon active experiment run)
Code changes
And many more…

The key components are the Charts and Panels that will help us monitor the fine-tuning process. In this case, the training loss is logged automatically by Comet as it can inter-communicate with the executed Pytorch code.

ℹ️ To enable comet package to log everything automatically by default, make sure you import comet_ml before importing torch in your script.

Comparing Experiments

Let’s see how we can compare multiple experiments to identify the key set of parameters and insights from the fine-tuning process.

Check desired experiments and select Compare.

This will overlap the experiments and provide a common view, that makes it easier to spot key insights from the training process.

Next, let’s add another panel and populate it with other metrics. We’ll select validation_loss . To do that, click on Add Panel select the Line Chart type and under the Y-Axis select eval_loss and then Done.

One more very useful feature that Comet offers is — Code Diff, where you get a git-like interface to compare code changes between the experiments.

Here’s how it looks:

With all the features it offers, the extensibility of the UI dashboard as well as the dev experience, CometML [11] takes a top spot in the MLOps Lifecycle Modelling Stage.

Ending Notes and Conclusion

Here we’re wrapping up Lesson 7 of the LLM Twin free course.

In this lesson, we’ve covered the end-to-end fine-tuning process for a Mistral7b-Instruct model, while using MLOps recommended practices of versioning, containerization, reproducibility, and experiment tracking.

We’ve also covered in detail not one, but two powerful MLOps platforms, CometML [11] to track our Experiments and help us monitor the parameters, datasets, code changes, and metrics as well as Qwak [2] to encapsulate and easily deploy our fine-tuning workflow with just a few clicks.

Completing Lesson7, you’ve gained a good understanding of the fine-tuning, and data preparation for a Mistral7b-Instruct model as well as in-detail topics like special-tokens, reducing model size, Peft, BitsAndBytes, and LoRA.

Alongside, you’ve learned to use CometML to track/compare training experiments and Qwak to encapsulate and deploy training/inference for LLM workloads to the cloud with just a few lines of code and a smooth dev experience.

In Lesson 8, we’ll cover the evaluation topic. We’ll discuss common evaluation techniques, and traditional metrics, and dive into production-stage recommendations on the topic. See you there!

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Enjoyed This Article?

Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE ↓

Decoding ML Newsletter | Substack

Join for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For…

decodingml.substack.com

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform landing Page

[3] Qwak Publication, The Qwak.ai Medium Publication

[4] Qwak Processing Unit (QPU), The Qwak.ai GPU Instances Cost

[5] LLM Fine-tuning, Guide on Fine-Tuning from Microsoft

[6] PEFT, Parameter Efficient Fine Tuning

[7] LoRA, What is Low-Rank Adaptation

[8] BitsAndBytes, Low Precision Operations in Transformers

[9] HuggingFace, HuggingFace Landing Page

[10] Mistral7b, The Mistral7B-Instruct Model Page

[11] CometML, The CometML Experiment Tracking Platform

[12] QDrant, The QDrant Landing Page