Unlocking LLM: Running LLaMa-2 70B on a GPU with Langchain

7 min readAug 5, 2023

Recently, Meta released its sophisticated large language model, LLaMa 2, in three variants: 7 billion parameters, 13 billion parameters, and 70 billion parameters. Detailed information and model download links are available here.

This blog post explores the deployment of the LLaMa 2 70B model on a GPU to create a Question-Answering (QA) system. We will guide you through the architecture setup using Langchain illustrating two different configuration methods.

First, we’ll outline how to set up the system on a personal machine with an NVIDIA GeForce 1080i 4GiB, operating on Windows. This part focuses on loading the LLaMa 2 7B model.

Then, we’ll switch gears to an AWS infrastructure, specifically a g4dn.metal instance, to demonstrate the deployment of the more resource-intensive LLaMa 2 70B model. This powerful setup offers 8 GPUs, 96 VPCs, 384GiB of RAM, and a considerable 128GiB of GPU memory, all operating on an Ubuntu machine, pre-configured for CUDA.

For more details on this AWS instance, please refer here. Now, let’s embark on this exciting journey of setting up a high-performance QA system.

Setting Up on Windows

Before you start, it’s important to set up the right environment for your system. Here’s the process:

Step 1: Create a Virtual Environment

Initiate by creating a virtual environment to avoid potential dependency conflicts. For this demonstration, we’ll use conda, a popular package manager. Make sure Python 3.10.9 is installed.

Create your virtual environment using:

conda create -n gpu python=3.10.9 -y
conda activate gpu

Step 2: Install the Required PyTorch Libraries

Install the necessary PyTorch libraries using the command below:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

Step 3: Configure the Python Wrapper of llama.cpp

We’ll use the Python wrapper of llama.cpp, llama-cpp-python. To enable GPU support, set certain environment variables before compiling:

set CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
set FORCE_CMAKE = 1

Now, you can install the llama-cpp-python library. If previously installed, uninstall and use the — no-cache-dir argument to prevent Python from using the cached version.

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Then install the langchain:

pip install langchain

Once the environment is set up, we’re able to load the LLaMa 2 7B model onto a GPU and carry out a test run. Utilize cuda.current_device() to ascertain which CUDA device is ready for execution. Be aware that the n_gpu_layers parameter is passed to the model, indicating the number of GPU layers that should be used. Below is a Python code snippet illustrating this:

from langchain.llms import LlamaCpp
from torch import cuda

print(cuda.current_device())

model_path = r'llama-2-7b-chat-codeCherryPop.ggmlv3.q2_K.bin'

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=4,
    n_ctx=512,
    temperature=0
)

output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"])

print(output)

Setting up on Linux

Now, let’s move on to loading the more substantial 70B model on 8 GPUs using an AWS g4dn.metal instance running on an Ubuntu machine. Establish and activate a Conda environment as before:

conda create -n gpu python=3.10.9 -y
conda activate gpu

To ensure GPU compatibility, you should install llama-cpp-python using the commands provided below. Please note that it's important to set CMAKE_ARGS="-DLLAMA_CUBLAS=on" and FORCE_CMAKE=1. Additionally, include the --no-cache-dir argument to avoid installation from the cache, particularly if you have previously installed this package. This ensures a clean, fresh installation each time.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Then, install langchain:

pip install langchain

To load the LLaMa 2 70B model, modify the preceding code to include a new parameter, n_gqa=8:

from langchain.llms import LlamaCpp

model_path = r'llama-2-70b-chat.ggmlv3.q3_K_L.bin'

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=84,
    n_ctx=512,
    temperature=0,
    n_gqa=8
)

# time.sleep(10)
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32

The following output demonstrates that the weights are being loaded onto the GPU.

llama.cpp: loading model from dllama-2-70b-chat.ggmlv3.q3_K_L.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 13 (mostly Q3_K - Large)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla T4) as main device
llama_model_load_internal: mem required  = 27009.75 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 1600 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/83 layers to GPU
llama_model_load_internal: total VRAM used: 10140 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

The command below allows you to monitor GPU usage:

watch -d nvidia-smi

Check the output that follows.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
| N/A   40C    P0    52W /  70W |  12761MiB / 15360MiB |     49%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:19:00.0 Off |                    0 |
| N/A   39C    P0    54W /  70W |   5771MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:35:00.0 Off |                    0 |
| N/A   38C    P0    47W /  70W |   5771MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:36:00.0 Off |                    0 |
| N/A   38C    P0    49W /  70W |   5771MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla T4            On   | 00000000:E7:00.0 Off |                    0 |
| N/A   39C    P0    45W /  70W |   5769MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla T4            On   | 00000000:E8:00.0 Off |                    0 |
| N/A   38C    P0    50W /  70W |   5769MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla T4            On   | 00000000:F4:00.0 Off |                    0 |
| N/A   39C    P0    48W /  70W |   5769MiB / 15360MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla T4            On   | 00000000:F5:00.0 Off |                    0 |
| N/A   38C    P0    50W /  70W |   5769MiB / 15360MiB |     22%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2849      C   /opt/tensorflow/bin/python      12738MiB |
|    1   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    2   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    3   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    4   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    5   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    6   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
|    7   N/A  N/A      2849      C   /opt/tensorflow/bin/python       5748MiB |
+-----------------------------------------------------------------------------+

Conclusion,

In this comprehensive guide, we’ve navigated through the process of utilizing the powerful LLaMa2 7B model on a local Windows machine with GPU support. By compiling the llama-cpp-python wrapper, we’ve successfully enabled the GPU support, ensuring that our operations are significantly accelerated, thereby boosting computational efficiency. The process of setting up this framework seamlessly merges machine learning algorithms with hardware capabilities, demonstrating the incredible potential of this integration.

Further, we delved into the steps needed to complete the llama-cpp-python on a Linux system, again targeting GPU support. We walked through the setup intricacies and detailed the necessary steps to ensure the smooth execution of this model on the Linux platform, highlighting its cross-platform adaptability.

Finally, we loaded the formidable LLaMa2 70B model on our GPU, putting it through a series of tests to confirm its successful implementation. This process showcased the model’s capability and proved the efficacy of our setup procedure, confirming that our guide offers a robust approach to working with this potent language model.

In summary, this journey illuminates the future of machine learning and AI, stressing the importance of synergizing hardware advancements with evolving algorithms. Remember, the power of these models is in their application, and we encourage you to explore the limitless possibilities these techniques offer in reshaping our technological landscape.

Unlocking LLM: Running LLaMa-2 70B on a GPU with Langchain

Setting Up on Windows

Setting up on Linux

Conclusion,

Written by Sasika Roledene