Fine-Tuning CodeLlama 34B for Chat

Anchen
4 min readSep 6, 2023

--

While Meta’s Llama2 captured significant attention in the AI landscape, however the 34b model was notably absent for quite a while. For many, this 34b model is the ideal choice for running local LLMs, given its compatibility with a single 4090 GPU using 4-bit quantisation. I’ve been keenly waiting for Meta to release this particular model. Fortunately, Meta recently introduced CodeLlama which is a specialised model trained for coding-related tasks. However, according to their paper, it’s trained on an additional 500b tokens from the original Llama2’s 34b since the original Llama2 model was trained on 2t tokens. The Codellama only adds an additional 500b tokens during training and conceptually would be further fine-tuned for various downstream domains.

In this blog post, I’ll walk you through the complete process for fine-tuning the CodeLlama 34B model. After training, I’ll guide you on how to quantize the model and deploy it using Huggingface’s text generation inference framework.

QLora CodeLlama 34B

This article won’t delve into the specifics of using QLora for fine-tuning Llama models. If you’re not familiar with QLora or want to know more about fine-tuning Llama models using SFT and DPO, please refer to my previous post: https://medium.com/@anchen.li/fine-tune-llama-2-with-sft-and-dpo-8b57cf3ec69

Modifications to the SFTTrainer Script:
Before proceeding with the fine-tuning, there are a few changes you’ll need to make to the existing SFTTrainer script:

- We’ll be using the open-source Guanaco dataset for this exercise. This dataset is based on text chunks, so there’s no need for additional preprocessing steps.
- To fine-tune the CodeLlama 34B model on a single 4090 GPU, you’ll need to reduce the LoRa rank to 32 and set the maximum sequence length to 512 due to VRAM limitations.

You can find the updated script with the adjusted LoRa configurations here : https://gist.github.com/mzbac/a912894942fe625c50d2a4b79902e6cd

The training process is expected to take approximately 18 hours to complete 3 epochs, and you should anticipate a loss value around 1.2

Merging the Adapters

Once you’ve completed the fine-tuning process, it’s better to merge your adapters back into the base model. Running inference directly on adapters can result in slower performance due to the extra parameters and computations involved.

You can use the following script to merge the adapter on CPU to avoid OOM : https://gist.github.com/mzbac/16b0f4289059d18b8ed34345ae1ab168

python merge_peft_adapters.py - device cpu - base_model_name_or_path codellama/CodeLlama-34b-hf - peft_model_path ./results_new/final_checkpoint - output_dir ./merged_models/

Quantization

In the local LLM community, quantization is a commonly adopted strategy for optimizing models to run in resource-constrained environments. While there are multiple approaches to choose from, such as 4-bit quantization via bitsandbytes, this post won’t delve into the technical nuances of each method. If you’re interested in a more in-depth exploration of quantization techniques, Hugging Face offers an excellent blog post that covers the topic: https://huggingface.co/blog/merve/quantization

For our purposes, we’ll be using GPTQ as it enjoys widespread support. To get started, you’ll first need to install the `auto_gptq` package, which can be done using the command `pip install auto_gptq`. Once installed, you can proceed with running the accompanying script. Note that you’ll need to update the `pretrained_model_dir` and `quantized_model_dir` fields to match your specific paths.
https://gist.github.com/mzbac/e6f1a328ab2606cec63f879def6d1458

Inference via Text-Generation-Inference

Previously, I would manually copy and paste Python scripts to operate the Llama model on my Ubuntu server. However, I’ve discovered that `text-generation-inference` not only simplifies this process but also provides an optimal solution for running LLM. One of its standout features is its excellent support for multi-GPU inference. For instance, I can effortlessly run a 34b GPTQ model on both a 3060 and a 4070 GPU, achieving impressive speeds of around 100ms per token.

To run your fine-tuned LLM model using `text-generation-inference`, follow these quick steps:
- install docker
- install NVIDIA Container Toolkit (ask chatgpt if you don’t know how to do it)

Running on Two GPUs:

sudo docker run --gpus all --shm-size 1g -p 8001:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --quantize gptq --num-shard 2 --model-id mzbac/CodeLlama-34b-guanaco-gptq

Running on a Single GPU:

sudo docker run --gpus all --shm-size 1g -p 8001:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --quantize gptq --model-id mzbac/CodeLlama-34b-guanaco-gptq

Conclusion

In this post, we have successfully navigated the entire journey of fine-tuning the CodeLlama 34B model. Starting with Meta’s foundational work, we used QLora for fine-tuning, followed by post-training quantization. We then moved on to deploy the model in a production-ready state using text-generation inference.

Thanks to the open-source community, their collective efforts make it feasible for us to fine-tune LLM models on consumer-grade GPUs. Not only do we get to have a tailored model, but we also retain control over our data and privacy.

PS:

The fine-tuned CodeLLama chat model can be found here : https://huggingface.co/mzbac/CodeLlama-34b-guanaco-gptq

--

--