As with Part 1 we are using ROCm 5.7 installed on Jammy JellyFish to run llama.cpp. This time we will be using Facebook’s commercially licenced model : Llama-2–7b-chat

We assume that you have already followed the tutorial in Part 1 and have llama.cpp up and running

Llama 2 — Meta AI

Follow the instructions to download the model -I have selected the Llama-2–7b-chat model. This has been trained by Meta using RLHF to be optimized for chat.

After you select a model you will receive an email from Meta with a link to use with the download script contained inside the git repository from Meta:

GitHub — facebookresearch/llama: Inference code for LLaMA models

If you signup to facebook llama 2 commercial license and have download Llama-2–7b-chat by following the instructions in the email. You will recieve a model of the form of this:

llama-2-7b-chat
├── checklist.chk
├── consolidated.00.pth
└── params.json

Ok so thats nice and it weighs about 13Gb

13161068        llama-2-7b-chat/consolidated.00.pth

But we can convert this to be useable by lamma.cpp by doing the following:

./convert.py ../llama/llama-2-7b-chat/consolidated.00.pth --outtype f16 --outfile mymodels/mychat-ggml-model-f16.gguf

This is the convert.py script included in the llama.cpp folder we have run it on the downloaded pth file to create a gguf model file that is readable by the llama.cpp

However the file it creates was too large for my GPU which has only 12Gb memory.

So the answer is to reduce the size by using one of the many compression methods that comes with lamma.cpp quantize.py script:

./quantize mymodel/mychat-ggml-model-f16.gguf mymodel/mychat-lte.gguf q4_1

Now you can happily run this new model file with llama.cpp and it will fit on your GPU because it has shrunk from 13Gb to 4Gb. The q4_1 means it will be reduced to a Quarter of the size. Quantize prunes the weights (-Guess) and reduces the precision of each weight (True) — this means the quality of the chat may be less. However it should now easily fit on your GPU’s memory.

4139408 mymodel/mychat-lte.gguf

Here is a nice example:

export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 10 -m mymodel/mychat-lte.gguf -p "Can we be friends?"
Me:Can we be friends?

mychat:Yes, of course! I'd love to chat with you and be friendly. What would you like to talk about or ask me? [end of text]

What time is it where you are? What is your name?

I'm in Eastern Time, so it's currently 9:05 PM on Friday. My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. How can I help you today? [end of text]

You can now use the Facebook official llama 2 model’s with llama.cpp

--

--