Efficiently Run Your Fine-Tuned LLM Locally Using Llama.cpp 🚀

Matan Kleyman
Vendi AI
Published in
5 min readOct 3, 2023

Unlock ultra-fast performance on your fine-tuned LLM (Language Learning Model) using the Llama.cpp library on local hardware, like PCs and Macs. Let’s dive into a tutorial that navigates through converting, quantizing, and benchmarking an LLM on a Mac M1.

Introduction 🌐

Running LLMs on your computer’s CPU is getting a lot of attention lately, with lots of tools trying to make it easier and faster.. This tutorial spotlights Llama.cpp, demonstrating running a LLaMa 2 7b and outpacing conventional runtime benchmarks of deep learning models we are used to. Llama.cpp stands as an inference implementation of various LLM architecture models, implemented purely in C/C++ which results in very high performance.

First Step: Picking Your Model 🗄️

Llama.cpp expects the llm model in a ‘gguf’ format. While you can find models in this format, especially from creators like TheBloke on platforms like Hugging Face,

sometimes you might need to change your PyTorch model weights into the gguf format. We’ll guide you through this process step by step!

Installation 📲

First, let’s get llama.cpp and set it up:

git clone llama.cpp
cd llama.cpp
MAKE # If you got CPU
MAKE CUBLAS=1 # If you got GPU

Next, we should download the original weights of any model from huggingace that is based on one of the llama.cpp supported model architectures.

For this guide we will be using UniNer, which is a large language model that was fine tuned on llama-7b for entity extraction tasks.

Download the model like this:

pip install transformers datasets sentencepiece
huggingface-cli download Universal-NER/UniNER-7B-type --local-dir models

Check the models folder to make sure everything downloaded. Then, run the llama.cpp convert script:

python convert.py ./models

The script takes the original .pth files and switches them to .gguf format. You should see a file named ggml-model-f32.gguf that weighs 27G.

Screenshot taken by the Author

The gguf format is recently new, published in Aug 23. It is used to load the weights and run the cpp code.

This is a mandatory step in order to be able to later on load the model into llama.cpp.

Quantization

Quantization of deep neural networks is the process of taking full precision weights, 32bit floating points, and convert them to smaller approximate representation like 4bit /8 bit etc..

Why is this important?

  • Small models use less VRAM, which is easier on your GPU.
  • They are lighter in weight.
  • They run much faster.

Here’s a quick peek at the different ways to shrink models with llama.cpp:

Screenshot from llama.cpp repository

Choosing a smaller bit point makes the model run faster but might sacrifice a bit of accuracy. We’ll use q4_1, which balances speed and accuracy well.

./quantize models/ggml-model-f32.gguf models/quantized_q4_1.gguf q4_1

Each weight layer should get about 7x smaller, so the final size should be 1/7 of the original!

Quantization screenshot

Let’s compare the size of the original and the quantized model

ls ./models

Original weights: 26.9G 📦 | Shrunk: 4.2G 🎈

🎉 Woohoo! Your model is now smaller and faster.

Running the model 🏃‍♂️

You have two main ways to run your cpp models:

  • CLI Inference —The model loads, runs the prompt, and unloads in one go. Good for a single run.
  • Server Inference — The model loads into RAM and starts a server. It stays loaded as long as the server is running.

To run the server:

./server -m ./models/quantized_q4_1.gguf -c 1024

Logs will tell you that the server run is up and running!

Making a Prediction 🔮

We consume predictions through api request calls

url = f"http://localhost:8000/completion"
prompt = "<prompt_placeholder>"

req_json = {
"stream": False,
"n_predict": 400,
"temperature": 0,
"stop": [
"</s>",
],
"repeat_last_n": 256,
"repeat_penalty": 1,
"top_k": 20,
"top_p": 0.75,
"tfs_z": 1,
"typical_p": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"mirostat": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"grammar": "",
"n_probs": 0,
"prompt": prompt
}

res = requests.post(url, json=req_json)
result = res.json()["content"]

How’s the Performance? 📊

We look at how many tokens (pieces of text) models can process per second to measure performance. Every run involves:

  • Prompt evaluation — How long to process the prompt tokens.
  • Generation — How long to make new tokens.
  • Sampling — How long to sample different possibilities.

Running the q4_1 shrunk model on my MacBook Pro M1 gives us:

Prompt:A virtual assistant answers questions from a user based on the provided text. USER: “I have no medical history worth mentioning except for child asthema” ASSISTANT: I’ve read this text.</s> USER: What describes medical history in the text? ASSISTANT: ”

Output: [“child asthema”]

Results 📸

The total time took to process the request is 804ms, less than 1 second!

Remember, in the world of LLMs, it’s all about how many tokens can be processed each second, the higher the better!

Go ahead, clone llama.cpp and run it against your model with the prompt you like. Check how fast it is on your own!

Stay connected with Vendi AI 💌. Connect with us on Linkedin Page , and checkout our GitHub.

--

--