llama.cpp is a project that enables the use of Llama 2, an open-source LLM produced by Meta and former Facebook, in C++ while providing several optimizations and additional convenience features. If you are curious about Llama 2, you can refer to the following post

The hallmark of llama.cpp is that the existing llama 2 makes it difficult to use without a GPU, but with additional optimization, it also allows 4-bit integration to run on the CPU. LLama-cpp-python, LLamaSharp is a ported version of llama.cpp for use in Python and C#/.Net, respectively

As mentioned earlier, llama.cpp is a project that makes the original llama 2 easier to use and less computing resources, and improvements are still being made very actively. There are two main ways to use llama.cpp

1. How to build directly with git clone

2. How to select and download the version you want from Assets in Releases

The method of using git clone itself will not be treated as being well described in detail in the usage of github readme. The contents covered in usage are as follows

Still, the elements to pay attention to in each item are Build and Blas Build first. Basic Build is very simple, but you can just build it through make or cmake as you usually do. I did it through cmake because it was convenient for me to make

mkdir build
cd build
cmake ..
cmake --build . --config Release

What you can see by looking at the Build part is that llama.cpp also supports Metal Build and MPI Build. In other words, MacOS supports builds of versions that use GPUs and builds in a cluster environment. For more information, please take a closer look at the usage Build part

A BLAS build literally supports building so that you can take full advantage of BLAS. BLAS is Basic Linear Algebra Subprograms, a protocol for a low-level operation set, and there are various types of BLAS builds that are optimized to run general linear algebraic operations faster depending on the hardware context. The types of BLAS builds supported by llama.cpp are OpenBLAS, BLIS, Intel MKL, cuBLAS, hipBLAS, and CLBlast

The types that are commonly used here are OpenBLAS, cuBLAS, and CLBlast, each corresponding to the following hardware

When using the BLAS built version, the BLAS part is logged as 1. Otherwise, the BLAS build has not been done normally, and performance optimization through BLAS will not be done

The next parts to look at are Memory/Disk Requirements and Quantification

Even if the model is quantized, the capacity of the model is still burdensome, as can be seen from the tables. However, compared to the model that has not undergone quantization, it is certain that it is clearly easier to handle

If you look at the table in the quantization section, you can see indicators according to the number of parameters and the degree of quantization, although they are not the latest data. If you want to see the latest data, you can look at the Model Card of the Llama-related GGM model posted by TheBloke on Hugging Face

There are three categories of GGM models posted by TheBlock: 7B, 13B, and 70B, and if you look at the Files of each repository, the models according to the degree of quantization are uploaded

When downloading the model here, you can download it directly from your browser, write Git LFS, or use the Hugging Face Hub package available in Python

To download a particular model via Git LFS, use the following

git lfs install
git clone git@hf.co:<MODEL ID> # example: git clone git@hf.co:bigscience/bloom

To download a particular model through the Hugging Face Hub package, use the following

from huggingface_hub import hf_hub_download

REPO_ID = "YOUR_REPO_ID" # Ex) TheBloke/Llama-2-7B-Chat-GGML
FILENAME = "sklearn_model.joblib" # Ex) llama-2-7b-chat.ggmlv3.q2_K.bin

hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

If you download the model via hf_hub_download, it is installed on the following path by default

"C:\Users\<user name>\.cache\huggingface\hub\models--TheBloke--Llama-2-7B-Chat-GGML\snapshots\76cd63c351ae389e1d4b91cab2cf470aab11864b\llama-2-7b-chat.ggmlv3.q4_0.bin"

This model is a GGML model, and llama.cpp uses a GGUF model. Therefore, in order to use the GGML model in llama.cpp, it must go through a conversion process to the GGUF model, and there is a Python source code file within llama.cpp that performs this

When using it in practice, you can use it as follows — essentially because the Llama-related GGM models posted by TheBlock are the Llama 2 target

convert-llama-ggml-to-gguf.py --eps 1e-5 --input <GGML model path> --input <GGUF model path>

With the above descriptions, there will be no major difficulties in using llama.cpp. Additionally, you can look at the Docs part at the bottom of the readme for instructions and performance improvement tips provided by llama.cpp

As a side note, I will introduce this to performance trickleshooting, which is a faster way to use the main example that is actually useful among the basic example codes provided by llama.cpp

If you don’t need the full source code of llama.cpp, but only the executable file of example code, you can select and use the one built in the type you want in Releases

If you look at it, there are only ones built for Windows, but there are ones built for AVX or for cuBLAS, CLBlast, and OpenBLAS, so it’s easy to use. The contents of each zip file are main.exe, server.exe, and other final executable files

llama-cpp-python and LLamaSharp are versions of llama.cpp ported for Python and c#/.Net, respectively. The installation itself is very simple, as it is registered with PyPI and Nuget, respectively. Of course, the installation may fail for various reasons depending on the individual’s local environment, and I will also deal with the problems I experienced and solutions here

First is llama-cpp-python

The method of installing llama-cpp-python itself is very simple. If you are in an environment where you can use pip, you can follow the command

pip install llama-cpp-python

If you have already installed it, but you want to reinstall it with different options, you can install it based on the following command

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

llama-cpp-python also supports builds for optimization through BLAS. For this, you can refer to the following images

However, the problem is that if you try to install llama-cpp-python on Windows, an error may occur. This is a problem that has already been reported as Issue a lot

Another issue below discusses how to solve this problem

In this article, the 55th line of the llama-cpp-python/llama_cpp/llama_cpp.py file, which is the part that imports the dll of llama.cpp, is due to a bug seen in Windows. In fact, it is easy to see that there is a problem with the line even if you look at the log during installation

 return ctypes.CDLL(str(_lib_path)) 

The person who wrote says that this problem can be solved by modifying the code above as follows

ctypes.CDLL(str(_lib_path),winmode=0)

However, as a result of my own experiment, this alone did not work out. However, if you look at the rest of the code with the problematic code, you can see that CUDA_PATH refers to the environment variable when the function runs on the WINDows platform

def _load_shared_library(lib_base_name: str):
# Construct the paths to the possible shared library names
_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__)))
# Searching for the library in the current directory under the name "libllama" (default name
# for llamacpp) and "llama" (default name for this repo)
_lib_paths: List[pathlib.Path] = []
# Determine the file extension based on the platform
if sys.platform.startswith("linux"):
_lib_paths += [
_base_path / f"lib{lib_base_name}.so",
]
elif sys.platform == "darwin":
_lib_paths += [
_base_path / f"lib{lib_base_name}.so",
_base_path / f"lib{lib_base_name}.dylib",
]
elif sys.platform == "win32":
_lib_paths += [
_base_path / f"{lib_base_name}.dll",
]
else:
raise RuntimeError("Unsupported platform")

if "LLAMA_CPP_LIB" in os.environ:
lib_base_name = os.environ["LLAMA_CPP_LIB"]
_lib = pathlib.Path(lib_base_name)
_base_path = _lib.parent.resolve()
_lib_paths = [_lib.resolve()]

cdll_args = dict() # type: ignore
# Add the library directory to the DLL search path on Windows (if needed)
if sys.platform == "win32" and sys.version_info >= (3, 8):
os.add_dll_directory(str(_base_path))
if "CUDA_PATH" in os.environ:
os.add_dll_directory(os.path.join(os.environ["CUDA_PATH"], "bin"))
os.add_dll_directory(os.path.join(os.environ["CUDA_PATH"], "lib"))
cdll_args["winmode"] = ctypes.RTLD_GLOBAL

# Try to load the shared library, handling potential errors
for _lib_path in _lib_paths:
if _lib_path.exists():
try:
return ctypes.CDLL(str(_lib_path), **cdll_args)
except Exception as e:
raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")

raise FileNotFoundError(
f"Shared library with base name '{lib_base_name}' not found"
)

So, CUDA Toolkit was installed to normally add CUDA_PATH to the environment variable

This way, in my case, llama-cpp-python was installed normally. You can install it through the whl file listed in Releases

llama-cpp-python is completely covered in python, providing both a high-level API that hides the complexity of llama.cpp and a low-level API that simply ports llama.cpp

# High-Level API
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output)
{
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"object": "text_completion",
"created": 1679561337,
"model": "./models/7B/llama-model.gguf",
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 28,
"total_tokens": 42
}
}
# Low-Level API
>>> import llama_cpp
>>> import ctypes
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)

LLamaSharp is a method in which the parts in charge of LLamaSharp and Backend are used in combination. The types of Backend are as follows

LLamaSharp.Backend.Cpu        # cpu for windows, linux and mac
LLamaSharp.Backend.Cuda11 # cuda11 for windows and linux
LLamaSharp.Backend.Cuda12 # cuda12 for windows and linux
LLamaSharp.Backend.MacMetal # metal for mac

To install LLamaSharp, you can install it as the Nugget Package Manager as usual in Visual Studio or run the following commands

dotnet add package LLamaSharp --version 0.6.0
dotnet add package LLamaSharp.Backend.Cpu --version 0.6.0
dotnet add package LLamaSharp.Backend.Cuda11 --version 0.6.0
dotnet add package LLamaSharp.Backend.Cuda12 --version 0.6.0
dotnet add package LLamaSharp.Backend.MacMetal --version 0.6.0

There’s not much to worry about about dependency either

However, when I tested it, CUDA version other than LLamaSharp.Backend.Cpu did not work properly in my environment. Also, it seemed necessary to pay attention to the use of the various options that were adjustable in llama.cpp, such as the part that seemed to be correct to be processed as a random number was also hardcoded internally

[JsonConstructor]
public ModelParams(string modelPath)
{
ContextSize = 512u;
GpuLayerCount = 20;
Seed = 1686349486u;
UseFp16Memory = true;
UseMemorymap = true;
LoraAdapters = new AdapterCollection();
LoraBase = string.Empty;
BatchSize = 512u;
RopeFrequencyBase = 10000f;
RopeFrequencyScale = 1f;
Encoding = System.Text.Encoding.UTF8;
base._002Ector();
ModelPath = modelPath;
}

LLamaSharp’s example code is as follows

using LLama.Common;
using LLama;

string modelPath = "<Your model path>"; // change it to your own model path
var prompt = "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\r\n\r\nUser: Hello, Bob.\r\nBob: Hello. How may I help you today?\r\nUser: Please tell me the largest city in Europe.\r\nBob: Sure. The largest city in Europe is Moscow, the capital of Russia.\r\nUser:"; // use the "chat-with-bob" prompt here.

// Load a model
var parameters = new ModelParams(modelPath)
{
ContextSize = 1024,
Seed = 1337,
GpuLayerCount = 5
};
using var model = LLamaWeights.LoadFromFile(parameters);

// Initialize a chat session
using var context = model.CreateContext(parameters);
var ex = new InteractiveExecutor(context);
ChatSession session = new ChatSession(ex);

// show the prompt
Console.WriteLine();
Console.Write(prompt);

// run the inference in a loop to chat with LLM
while (prompt != "stop")
{
foreach (var text in session.Chat(prompt, new InferenceParams() { Temperature = 0.6f, AntiPrompts = new List<string> { "User:" } }))
{
Console.Write(text);
}
prompt = Console.ReadLine();
}

// save the session
session.SaveSession("SavedSessionPath");

--

--

HyunsuYu

As a game developer with Unity and C# as the main players, I've been working on a number of game development projects and side projects