Inference LLAMA-2 🦙7BQ4 With LlamaCPP, Without GPU

6 min readNov 11, 2023

The Major difference between Llama and Llama-2 is the size of data that the model was trained on , Llama-2 is trained on 40% more data than previous version and has a longer context length. One more advancement in llama-2 was that it used Reinforcement learning from Human Feedback(RLHF) during it’s training.

Why LLAMA-2:

Unlike other LLMs like Open-AI’s GPT-3, Google’s PaLM2, and Anthropic’s Claude — Llama-2 is an Open source Large Language Model, all LLMs I have mentioned are currently not open-sourced and can only be used through API calls.

Llama-2 is released in three formats based on the number of parameters,

Llama-2–7B
Llama-2–13B
Llama-2–70B

The 7,13 and 70B represent the number of model parameters in Billions (I know right! Huge😲) When it comes to the training of Llama 2, the model was educated on two million tokens — elements of raw text such as “Awe,” “eso,” and “ome” in the word “Awesome.” This represents a significant leap from Llama’s training, which was based on 1.4 trillion tokens.

Getting Access to Llama-2:

There are two ways to use the model,

We can use the Llama-2 Chat which is a fine-tuned version specifically used to provide utilities like a chatbot.
The second and the most efficient and coolest way is to download and inference it locally in your system. First, we must download the model in our system and write the inference code.

I will be using Huggingface for accessing and downloading the model.

HuggingFace is an AI community that promotes open source contributions. It is a hub of open source models for Natural Language Processing, computer vision, and other fields where AI plays its role. Even the tech giants like Google, Facebook, AWS, Microsoft, and others use the models, datasets, and libraries.

We will be employing the most miniature version of Llama, known as Llama 7B. Even at this reduced size, Llama 7B offers significant language processing capabilities, allowing us to achieve our desired outcomes efficiently and effectively. To execute the LLM on a local CPU, we need a local model in GGML format. Several methods can accomplish this, but the most straightforward approach is to download the bin file directly from the Hugging Face Models repository 🤗.

If you’re looking to save time and effort, don’t worry — I’ve got you covered. Here’s the direct link for you to download the models ⏬. go to the files and version section, select llama-7b.ggmlv3.q4_0.bin, and download the .bin file.

Understanding Quantization:

The name of our model is llama-7b.ggmlv2.q4_0.bin. Let us break the name into parts,

llama: Name of our Large Language Model🤖.

7b: Number of model parameters.

ggmlv2: GGML is the format for LLM that we are using, there are many ways to store LLM in local machines with the model parameters, GGML is a Tensor library for machine learning, it is just a C++ library that allows you to run LLMs on just the CPU or CPU + GPU. It defines a binary format for distributing large language models (LLMs).

q4: Q stands for Quantization. As we all know in our computers everything is represented in binary form with 0s and 1s. In most cases every machine learning model has their model parameters in integer or floating point numbers, as integers and floating point numbers are also represented in binary, we have to decide the number of bits to represent a particular integer or floating point number in our system.

As it requires more space to represent a large integer (like 10000) compared to a small integer (like 1), the same way it needs more space to represent a floating point number with higher precision like 3.187954545 compared to low precision floating point number like 3.18.

The process of quantizing a large language model involves reducing the precision with which weights are represented to reduce the resources required to use the model. In Q4 format, we represent floating point numbers in four-bit. (GGML supports many quantized models like 4,5 and 8-bit )

As you have guessed, with increasing Quantization our model weights are stored with higher precision resulting in a higher accurate output of our llm. with increasing quantization and model parameters, the size of our model will also increase. therefore we are using the base version of Llama that runs on basic systems without GPU.

Binding Python and C++ with LlamaCPP

LlamaCPP is specifically built to do inference for Llama models and their versions, it is developed with low-level languages like C/C++. With the power of these languages, we can use Llamacpp to run quantized versions of Llama on our local machine. The significant and foremost advantage of Llamacpp is that it is optimized for CPU and GPU.

We will use the Python library “llama-cpp-python” for binding our Python code with llamacpp. First, let us make a virtual environment to install all specific dependencies for our project isolated.

python -m virtualenv venv

cd venv

Scripts\activate

after activating the virtual env we will install the llama-cpp-python with the specified version.

pip install llama-cpp-python==0.1.78

the reason why I am using the specific version is that currently, llama cpp is making changes in their codebase as the new file format gguf is in the market competing for the ggml format, they still have kept this version running for ggml formats.

Good to Go!

Now we just have to put the model that we downloaded earlier from Hugging Face in our current working directory. make a new Python file and just paste the bellow code :


from llama_cpp import Llama
llm = Llama(model_path="llama-7b.ggmlv3.q4_0.bin")

response=llm("Share some cool facts about The Office TV Series.")
print(response['choices'][0]['text'])

the variable response is a dictionary(or JSON in some cases that depends upon llm) that is returned by our llm. ['choices'] is accessing the value associated with the key 'choices' in the response dictionary. [0] is accessing the first element of the 'choices' list. In Python, list indices start at 0, so [0] get the first item. ['text'] is again accessing the value associated with the key 'text' in the dictionary that is the first element of the 'choices' list

depending upon your CPU, after executing the above code you will shortly see the below output:

The output shows different hyperparameters like n_vocab (Vocabulary size of the particular model), n_ctx(Context length or maximum length of the sequences that the model can handle), n_embd(This typically refers to the dimensionality of the embeddings in the model.), etc. As the model parameters are printed and loaded we will get the output for our question, something like this :

That’s it! 😎 Now off you go.. try with your own questions and see how it performs, and if you have GPU in your machine try to run it with cuBLAS (A library for computing high-speed matrix operations on NVIDIA GPU).