The cheapskate’s guide to fine-tuning LLaMA-2 and run on your laptop

Published in

The Low End Disruptor

6 min readSep 11, 2023

My mission

Everyone is GPU-poor these days, and some of us are poorer than the others. So my mission is to fine-tune a LLaMA-2 model with only one GPU on Google Colab, and run the trained model on my laptop using llama.cpp.

Why fine-tune an existing LLM?

A lot has been said about when to prompt engineering, when to do RAG (Retrieval Augmented Generation), and when to fine-tune an existing LLM model. I will not get into details about those arguments, and will leave you with two in-depth analyses to explore on your own.

DeepLearning.ai course by Lamini

Blog post “Why You (Probably) Don’t Need to Fine-tune an LLM” by Jessica Yao

Assume you still want to fine-tune your own LLM, let’s get started with fine-tuning.

Why cheapskate

You can fine-tune OpenAI’s GPT-3.5-turbo model, which has become increasingly affordable both for fine-tuning as well as for inference. There are a few reasons you don’t want to do that: your training data is super secret, you don’t want to pay OpenAI every time you use the fine-tuned model, you need to use your model without the Internet. In that case, we will use open source LLMs.

Right now, Meta’s LLaMA-2 is the golden standard of open source LLM with good performance and permissible license terms. And we will start with the smallest 7B model, since it will be cheaper and faster to fine-tune. Once you have gone through the whole process, you will be well on your way to 13B and 70B models if you like.

Training dataset: Dolly 15K by DataBricks

Training GPU: the easiest is to use Google Colab. I believe you do need to have a Colab Pro account which is $10 a month for 100 compute units. In the following examples, you will consume between 20–90 compute units which translates to $2–9. I hope we all can afford it, even for cheapskates.

Fine-tuning LLaMA-2 with QLoRA on single GPU

We have all heard about the tremendous cost associated with training a large language model, which is not something average Jack or Jill will undertake. But what we can do is to freeze the model weights in an existing LLM (e.g. 7B parameters), while fine-tuning a tiny adapter (less than 1% of total parameters, 130M for example).

One of these adapters is called LoRA (Low-Rank Adaptation), not to be confused with the red-haired heroine in the movie “Run, Lola, run!”.

In addition, QLoRA uses a frozen, 4-bit quantized pretrained language model instead of a 16-bit model into Low Rank Adapters (LoRA). Thus we can fit the entire training into the GRAM of a single commodity GPU.

You can find out the trade-offs between our method and the traditional full parameter method:

Blog post “ Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2” by anyscale

There are good tutorial and notebooks on fine-tuning LLaMA-2 models with LoRA, for example:

In this article, I’m using the OVH Cloud guide with minor changes to the training parameters.

# Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            num_train_epochs=3,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=10,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )

I used Google Colab Pro’s Nvidia A100 high memory instance, and the total fine-tuning ran about 7 hours and consumed 91 compute units.

Google Colab A100 high memory

Nvidia A100 high memory
CPU RAM: 83.5GB
GPU RAM: 40GB
13 computer units per hour

Actual memory usage during training:

CPU: 6.1GB
GPU: 25.8GB (varies)

You can certainly use a single T4 high memory (15GB GRAM) instance, which will take longer but cost less. I started but did not run through the entire training process, but it was estimated to be about 24 hours and 50 compute units. I’m quite sure someone can use Nvidia’s 4090 (24GB GRAM) or equivalent consumer GPU for this fine-tuning tasks as well.

Note: Phillip Schmid’s script has more tricks to reduce training time: my test run on A100 high memory instance lasted about an hour and 15 minutes and cost less than 20 compute units.

Once the training is done, we save the LoRA adapter’s final checkpoints to mounted google drive so we don’t lose them once the Google Colab session is over:

output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)

You can see that the file “adapter_model.bin” is tiny (152.7B) compared to llama2–7b’s “consolidated.00.pth” (13.5GB).

Inference with llama.cpp

Both tutorials for the fine-tuning uses GPU based inference, but a true cheapskate wants to use his/her own laptop with low spec CPU and GPU. Thus llama.cpp comes into play. Your fine-tuned 7B model will run comfortably with fast speed on a M1 based Macbook Pro with 16G unified RAM. And you can push to run the 13B model as well, if you free up some memory from resource hungry apps.

There are a few simple steps to get your recently fine-tuned model ready for llama.cpp use. All the models reside in the directory “models”. Let’s create a new directory called “lora” under “models”, copy over all the original llama2–7B files, and then copy over the two adapter files from the previous step. And the folder “lora” should have the following files

Step 1: Convert LoRA adapter model to ggml compatible mode:

python3 convert-lora-to-ggml.py models/lora

Step 2: Convert into f16/f32 models:

python3 convert.py models/lora

Step 3: Quantize to 4 bits:

./quantize ./models/lora/ggml-model-f16.gguf ./models/lora/ggml-model-q4_0.gguf q4_0

Now finally, you have your shining new gguf file that is baked with your special training data. It’s time to use it, or in fancy words “inference with it”!

./main -m models/lora/ggml-model-q4_0.gguf --color -ins -n -1

You can see llama-2–7b-lora is running blazing fast, while I have dozens of tabs open in two Chrome browsers, a Docker engine running Supabase and web server, Visual Studio Code, and all the instant messaging systems imaginable all on an average Macbook Pro M1 with 16GB memory.

Next steps

Congratulations! You have just fine-tuned your first personal LLM and ran it on your laptop. Now there are a few things you can do next:

Define a few use cases that fine-tuning an existing LLM will give you unique advantage
Prepare your own dataset for fine-tuning purpose: typically it will be like DataBricks-dolly-15k in question — answer pairs. Your private data most likely won’t be like that. So you can either use your own scripts, manual labor, and/or GPT-4 to format your data into the right training set.
Define your evaluation metrics, and compare the different approaches (prompt engineering, RAG, GPT-3.5 fine-tuning, open source LLM fine-tuning). Start with a small amount of training data, and build on your experience and success.
Show the world (and your boss) what you have just built!