Unlocking LLM: Running LLaMa-2 70B on a GPU with Langchain
Recently, Meta released its sophisticated large language model, LLaMa 2, in three variants: 7 billion parameters, 13 billion parameters, and 70 billion parameters. Detailed information and model download links are available here.
This blog post explores the deployment of the LLaMa 2 70B model on a GPU to create a Question-Answering (QA) system. We will guide you through the architecture setup using Langchain illustrating two different configuration methods.
First, we’ll outline how to set up the system on a personal machine with an NVIDIA GeForce 1080i 4GiB, operating on Windows. This part focuses on loading the LLaMa 2 7B model.
Then, we’ll switch gears to an AWS infrastructure, specifically a g4dn.metal instance, to demonstrate the deployment of the more resource-intensive LLaMa 2 70B model. This powerful setup offers 8 GPUs, 96 VPCs, 384GiB of RAM, and a considerable 128GiB of GPU memory, all operating on an Ubuntu machine, pre-configured for CUDA.
For more details on this AWS instance, please refer here. Now, let’s embark on this exciting journey of setting up a high-performance QA system.
Setting Up on Windows
Before you start, it’s important to set up the right environment for your system. Here’s the process:
Step 1: Create a Virtual Environment
Initiate by creating a virtual environment to avoid potential dependency conflicts. For this demonstration, we’ll use conda, a popular package manager. Make sure Python 3.10.9 is installed.
Create your virtual environment using:
conda create -n gpu python=3.10.9 -y
conda activate gpu
Step 2: Install the Required PyTorch Libraries
Install the necessary PyTorch libraries using the command below:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
Step 3: Configure the Python Wrapper of llama.cpp
We’ll use the Python wrapper of llama.cpp, llama-cpp-python. To enable GPU support, set certain environment variables before compiling:
set CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
set FORCE_CMAKE = 1
Now, you can install the llama-cpp-python library. If previously installed, uninstall and use the — no-cache-dir argument to prevent Python from using the cached version.
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Then install the langchain:
pip install langchain
Once the environment is set up, we’re able to load the LLaMa 2 7B model onto a GPU and carry out a test run. Utilize cuda.current_device()
to ascertain which CUDA device is ready for execution. Be aware that the n_gpu_layers
parameter is passed to the model, indicating the number of GPU layers that should be used. Below is a Python code snippet illustrating this:
from langchain.llms import LlamaCpp
from torch import cuda
print(cuda.current_device())
model_path = r'llama-2-7b-chat-codeCherryPop.ggmlv3.q2_K.bin'
llm = LlamaCpp(
model_path=model_path,
n_gpu_layers=4,
n_ctx=512,
temperature=0
)
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"])
print(output)
Setting up on Linux
Now, let’s move on to loading the more substantial 70B model on 8 GPUs using an AWS g4dn.metal instance running on an Ubuntu machine. Establish and activate a Conda environment as before:
conda create -n gpu python=3.10.9 -y
conda activate gpu
To ensure GPU compatibility, you should install llama-cpp-python
using the commands provided below. Please note that it's important to set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
and FORCE_CMAKE=1
. Additionally, include the --no-cache-dir
argument to avoid installation from the cache, particularly if you have previously installed this package. This ensures a clean, fresh installation each time.
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Then, install langchain:
pip install langchain
To load the LLaMa 2 70B model, modify the preceding code to include a new parameter, n_gqa=8
:
from langchain.llms import LlamaCpp
model_path = r'llama-2-70b-chat.ggmlv3.q3_K_L.bin'
llm = LlamaCpp(
model_path=model_path,
n_gpu_layers=84,
n_ctx=512,
temperature=0,
n_gqa=8
)
# time.sleep(10)
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32
The following output demonstrates that the weights are being loaded onto the GPU.
llama.cpp: loading model from dllama-2-70b-chat.ggmlv3.q3_K_L.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 13 (mostly Q3_K - Large)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla T4) as main device
llama_model_load_internal: mem required = 27009.75 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 1600 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/83 layers to GPU
llama_model_load_internal: total VRAM used: 10140 MB
llama_new_context_with_model: kv self size = 1280.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
The command below allows you to monitor GPU usage:
watch -d nvidia-smi
Check the output that follows.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:18:00.0 Off | 0 |
| N/A 40C P0 52W / 70W | 12761MiB / 15360MiB | 49% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:19:00.0 Off | 0 |
| N/A 39C P0 54W / 70W | 5771MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:35:00.0 Off | 0 |
| N/A 38C P0 47W / 70W | 5771MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:36:00.0 Off | 0 |
| N/A 38C P0 49W / 70W | 5771MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla T4 On | 00000000:E7:00.0 Off | 0 |
| N/A 39C P0 45W / 70W | 5769MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla T4 On | 00000000:E8:00.0 Off | 0 |
| N/A 38C P0 50W / 70W | 5769MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla T4 On | 00000000:F4:00.0 Off | 0 |
| N/A 39C P0 48W / 70W | 5769MiB / 15360MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla T4 On | 00000000:F5:00.0 Off | 0 |
| N/A 38C P0 50W / 70W | 5769MiB / 15360MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2849 C /opt/tensorflow/bin/python 12738MiB |
| 1 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 2 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 3 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 4 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 5 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 6 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
| 7 N/A N/A 2849 C /opt/tensorflow/bin/python 5748MiB |
+-----------------------------------------------------------------------------+
Conclusion,
In this comprehensive guide, we’ve navigated through the process of utilizing the powerful LLaMa2 7B model on a local Windows machine with GPU support. By compiling the llama-cpp-python wrapper, we’ve successfully enabled the GPU support, ensuring that our operations are significantly accelerated, thereby boosting computational efficiency. The process of setting up this framework seamlessly merges machine learning algorithms with hardware capabilities, demonstrating the incredible potential of this integration.
Further, we delved into the steps needed to complete the llama-cpp-python on a Linux system, again targeting GPU support. We walked through the setup intricacies and detailed the necessary steps to ensure the smooth execution of this model on the Linux platform, highlighting its cross-platform adaptability.
Finally, we loaded the formidable LLaMa2 70B model on our GPU, putting it through a series of tests to confirm its successful implementation. This process showcased the model’s capability and proved the efficacy of our setup procedure, confirming that our guide offers a robust approach to working with this potent language model.
In summary, this journey illuminates the future of machine learning and AI, stressing the importance of synergizing hardware advancements with evolving algorithms. Remember, the power of these models is in their application, and we encourage you to explore the limitless possibilities these techniques offer in reshaping our technological landscape.