LLMs Made Accessible: A Beginner’s Unified Guide to Local Deployment via Python

16 min readApr 3, 2024

The internet is overflowing with guides on how to run Large Language Models (LLMs) on your personal computer, yet navigating through this vast information can be daunting, especially for those new to the scene. Most guides tend to focus on a single approach, making it challenging for beginners to find a comprehensive pathway. This is where my article steps in, aiming to bring several renowned methods under one roof, specifically tailored for beginners. The unique angle of this guide is its focus on Python-based methods, recognizing Python’s simplicity and its widespread adoption within the AI community. This guide is crafted to demystify the setup process of LLMs on your machine by providing clear, step-by-step instructions, insights on different Python libraries and tools, and practical tips, all designed to streamline your entry into the world of LLMs. Whether the plethora of technical details and the array of options available have seemed intimidating, this article is your go-to resource, ensuring an accessible and enlightening journey into running LLMs directly on your own computer. This article is designed to provide a swift introduction to working with models in different frameworks, regardless of whether you’re using a CPU or GPU. It doesn’t cover every aspect of loading Large Language Models (LLMs) in depth. For more comprehensive insights, it’s recommended to refer to the relevant documentation.

The guide will be sharing these methods to run LLMs locally

Huggingface: A comprehensive library with a vast model repository and an intuitive interface.
Llama.cpp python: A Python-friendly frontend for llama.cpp to utilize LLMs with the performance of C++.
llama.cpp-based API drop-in replacement for GPT-3.5: Executing the model in a distinct process, whether on the same machine or a server, enables inference through APIs akin to those used by GPT 3.5 clients

The aforementioned methods have been successfully tested using WSL2, operating on a Windows 11 system powered by a 13th Gen Intel Raptor Lake processor, equipped with an NVIDIA RTX 3060 card featuring 16GB of memory and with CUDA version 12.1.

Considerations Running Large Language Models (LLMs)

Diving into the world of Large Language Models (LLMs) brings to light a range of considerations crucial for their efficient deployment and operation. From hardware prerequisites to model access protocols, each element plays a pivotal role in harnessing the full potential of LLMs.

For efficient operation of LLMs, a powerful GPU is crucial, particularly for models requiring quantization. Smaller models, such as GPT-2, can run on CPUs but are limited by system RAM; for example, GPT-2’s 500MB size is manageable on modern CPUs, whereas larger models like the Llama 2 7B (13GB size), while technically runnable on CPU, suffer from slow inference speeds even with adequate RAM.

The HuggingFace library facilitates automatic model downloads and provides access to a wide range of models. Please note that accessing some models like Meta’s Llama2 via Hugging Face requires an approval process, including an account setup, application review, and adherence to Meta’s usage policies.

GPU deployment is optimal for LLMs, dependent on GPU RAM capacity. Models are generally in float16 or float32 formats, with full model loading possible on GPUs with higher capacities, like the NVidia 4060 Ti with 16GB RAM. For GPUs with lower RAM, quantization reduces memory requirements to a fraction of original size, enhancing compatibility and inference speed, albeit with a minor loss in accuracy. Overall, quantization is recommended for GPU-based LLM operations. Quantization plays a crucial role in enabling large language models (LLMs) to operate efficiently on local machines. Essentially, quantization is the process of reducing the precision of the numbers used to represent a model’s weights, which significantly decreases the size of the model without substantially sacrificing performance. This reduction is particularly important for LLMs, given their typically enormous size. For GPU inference, this guide focuses on utilizing Nvidia GPUs in conjunction with the CUDA framework.

Google Colab offers a convenient cloud-based solution for running Large Language Models (LLMs) with access to Nvidia GPUs, bypassing the need for personal hardware. It supports easy integration with major libraries, allowing for straightforward model execution. However, users of the free account face limitations, including restricted access duration and GPU availability, which may impact longer or more resource-intensive projects.

Now, let’s delve into the specifics of setting up each of these components individually.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Using HuggingFace

Hugging Face has revolutionized the artificial intelligence landscape, especially in natural language processing (NLP), with its open-source library, Transformers. This library offers a wide array of pre-trained models for tasks like text classification and language generation, simplifying the implementation of complex NLP tasks. Their work facilitates the rapid development of AI applications across numerous fields without the necessity of training models from the ground up.

Setup

Working configurations:

Python 3.8+ (Tested with 3.10)
Verified on Windows 11, Windows Subsystem for Linux (WSL) and Ubuntu 22.04.
Tested with Transformer version 4.36.1 (later version may require changes to inference.py)
Intel 12th Gen or higher
(Optional) GPU (e.g Nvidia RTX3060 12 GB) with CUDA 12.1

Following sections will list installation instructions for Linux and Windows both for CPU only and GPU. Install them accordingly.

Linux/WSL (CPU/GPU)

These steps are required for CPU only setup and also serve as preliminary requirements for configuring a GPU setup.

$ sudo apt install python3.10-venv python3.10-tk
$ pip install virtualenv 
$ python3.10 -m venv  venv
$ source ./venv/bin/activate
(venv)$ pip3 install tk numpy torch bertviz ipython transformers accelerate huggingface_hub hf_transfer

+Linux/WSL (GPU)

Install additional packages for GPU support on Linux

(venv)$ pip3 install bitsandbytes accelerate autoawq optimum auto-gptq

Windows Installation (CPU/GPU)

These steps are essential for setting up on a CPU and also act as foundational requirements for configuring a GPU setup on Windows.

> pip install virtualenv 
> python3.10 -m venv  venv
> venv\Scripts\activate
(venv)> pip3 install tk numpy torch bertviz ipython transformers huggingface_hub hf_transfer

+Windows (GPU)

Install/Update additional packages for GPU on Windows.

(venv)> pip3 install accelerate autoawq optimum auto-gptq
(venv)> pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
(venv)> python -m pip install bitsandbytes==0.39.1 --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui

For running Large Language Models (LLMs) on an NVidia GPU, several installations are essential:

PyTorch Setup with CUDA: To leverage NVidia GPU capabilities, install torch, torchvision, and torchaudio with CUDA support. While Linux users generally get CUDA support in the default PyTorch installation, Windows users must specify the index-url for torch+cuda installation. As per the latest updates, PyTorch is compatible with CUDA toolkit upto version 12.1. Visit the PyTorch website (PyTorch — Get Started Locally) for up-to-date information on CUDA compatibility and installation guidelines.
bitsandbytes Installation Variations: The standard installation of bitsandbytes is optimized for Linux, accommodating multiple CUDA versions in one release build. Windows users, however, need to choose between building from source or using an unofficial GitHub repository for installation. Specific installation instructions for Windows can be found in the earlier section of this article. Further details and guidance are available at the bitsandbytes unofficial Windows repository (https://github.com/jllllll/bitsandbytes-windows-webui).
Cuda Toolkit Necessity: To facilitate GPU-based operations, install the Cuda toolkit from https://developer.nvidia.com/cuda-12-1-0-download-archive. It’s advisable to match the CUDA toolkit version with the torch+cuda version in use. Currently, latest torch supports CUDA version 12.1.
Microsoft C++ Build Tools: Running GPU processes on Windows requires the installation of Microsoft C++ Build Tools. This is due to the psutils dependency in the accelarate and autoawq packages.
autoawq package is needed only if AWQ based quantized model are to be run on GPU.

These steps are pivotal for effectively executing LLMs on NVidia GPUs, especially considering the specific requirements for Windows systems. Checkout link1 and link2 for further info on huggingface installation and model downloading.

Once you create an account on Huggingface, you will have the ability to generate a token. This token can be configured in your environment via export (e.g., HF_TOKEN) or used as a parameter with Huggingface APIs for downloading models and other portal operations. Think of this token as a pass for accessing services. Additionally, downloading certain models like Llama requires approval, which can be requested through the Huggingface portal using your account. Any approval granted is linked to your account and, consequently, to any token generated with that account.

Inference

The script specifies the inference_mode to select the appropriate hardware (CPU or GPU) for model execution. If inference_mode is set to ‘cpu’, the GPT-2 model is loaded, chosen for its smaller size (~500MB), making it ideal for CPU use with the device_map set to ‘cpu’ for targeted execution. Conversely, for GPU usage, the script opts for the larger Meta LLaMA 2 7B model (~13GB), leveraging device_map set to ‘auto’ for flexible device allocation and enabling load_in_4bit to minimize memory usage, facilitating the operation of large models on GPUs with limited memory. The pipeline function creates a text generation pipeline with the loaded model and tokenizer, setting parameters like max_new_tokens to control the length of generated text. The streamer is also passed to the pipeline, indicating that streamed text generation will be used.

inference.py

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, pipeline, BitsAndBytesConfig

inference_type = 'cpu'
if inference_type == 'cpu':
    # CPU inference
    model_name = 'gpt2'  #~500MB original size
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map='cpu'
    )
else:    
    # GPU inference; 'gpu'
    model_name = 'meta-llama/Llama-2-7b-hf'  # ~13GB Original size
    quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        device_map='auto',
        quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(
     tokenizer,
     skip_prompt=False,
     skip_special_tokens=False
)
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100, # Max number of tokens to generate
    streamer=streamer,     
)
prompt='I like apple pie'
text_pipeline(prompt)

In the example provided, we are utilizing a pretrained model which may also require specific keywords like the start of sequence tokens. To see if such tokens are expected by your model, you can check the add_bos_token & add_eos_token settings in the tokenizer_config.json file. A practical method to handle this automatically is to set add_special_tokens=True (default False) in the text_pipeline inference call; this allows the framework to decide whether to insert any special keywords into the prompt before it is sent to the model. The example above doesn’t use that parameter.

Run script

(venv)$ python inference.py

Output:

<s> I like apple pie, as that’s what I like to make; but I like this little black tart apple that’s so filling at the same time as the vanilla, and that’s also the kind of tart that you’ll think that it makes your tongue go watery

The load_in_4bit=True passed through BitsAndBytesConfig instruct transformer library to reduce model’s weight to 4 bit which effectively will be 25% of its original footprint (Llama 2 is float16 format), enhancing inference speed. bnb_4bit_compute_dtype is set to torch.bfloat16 to ensure that actual compute is still done by GPU in float16. An alternative option is load_in_8bit=True, which is mutually exclusive with load_in_4bit option. The 8-bit mode offers slightly improved accuracy but reduce the model size to 50% of the original and reduces inference speed. If neither quantization option is selected, the transformer library defaults to loading the entire model into the GPU’s video RAM, depending on the available VRAM.

Loading Pre-Quantized model

Instead of quantizing model at every load in GPU inference, we can load a prequantized model from disk. Two of the popular quantized formats are GPTQ and AWQ. Huggingface support both of them.

For AWQ type models just replace model load line with below code (in gpu inference path). Notice that BitsAndBytesConfig parameter is omitted since the model is already quantized.

model_name='TheBloke/zephyr-7B-alpha-AWQ'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto')

Similarly for GPTQ models replace model load line with below lines.

model_name='TheBloke/zephyr-7B-beta-GPTQ'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision='main',
    device_map='auto')

Notice the use of revision which correspond to github branch. zephyr-7B-beta-GPTQ quantization flavors are stored on different branches. for that see https://huggingface.co/TheBloke/zephyr-7B-beta-GPTQ ‘Files and versions’ tab.

GGUF Quantization

GGUF is another popular quantization introduced by llama.cpp. ctransformers is Python bindings for the Transformer models implemented in C/C++ using GGML/GGUF library. It also has integration with RAG Langchain.

Install ctranformers package according to your configuration.

(venv)$ pip install ctransformers

Either download GGUF format model using huggingface-cli or from_pretrained() call will auto download it on first API call to that model. The following code sample demonstrating how to utilize ctransformers for models in the GGUF format. The from_pretrained method automatically retrieves the model from Hugging Face if it hasn’t been downloaded previously. Some model types such as zephyr GGUF are stored as files in same repo. Here the repo name ‘TheBloke/zephyr-7B-beta-GGUF’ and model file is zephyr-7b-beta.Q4_K_M.gguf.

inference.py

from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained(
    'TheBloke/zephyr-7B-beta-GGUF',
    model_file='zephyr-7b-beta.Q4_K_M.gguf',
    model_type='llama',
    gpu_layers=32
)
prompt='<s>I like apple pie'
# Do inference with streaming
stream=llm(prompt, stream=True)
for chunk in stream:
    print(chunk, end="", flush=True)

Adjust the gpu_layers setting based on the specific model you are using. In this case, we are configuring it to offload 32 layers to the GPU. Setting stream=True for inference enables the display of tokens in real-time as they are produced by the model. see https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF for different quantized model versions available for zephyr.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

LLama.cpp python

Llama.cpp is a versatile C/C++ library, gaining recognition in AI for its ability to run inferences on large language models like Meta’s LLaMA model. The library supports diverse backends (CUDA, Metal, OpenCL, SYCL), enhancing its adaptability. The integration with Python through the llama-cpp-python package allows users to exploit C/C++’s performance while enjoying Python’s simplicity.

For installing llama-cpp-python, the preferred approach is to compile it from the source. This method is advised because llama.cpp, the underlying C/C++ library, utilizes compiler optimizations tailored to specific systems. Opting for pre-built binaries might mean foregoing these optimizations or having to manage a multitude of binaries for various platforms. When you compile llama-cpp-python, it automatically builds the underlying llama.cpp library as a library file (lib). This library file is then utilized in Python through specific bindings, enabling Python scripts to access and use the functionalities of llama.cpp. See llama-cpp-python github repo for compilation details.

The corresponding version of BLAS library need to be installed on the system. Checkout ‘BLAS Build’ section of llama.cpp README.

Intel ARC GPUs are supported via SYCL and for that Intel OneAPI need to be installed. Checkout SYCL section of README.

Detailed build instruction for llama-cpp-python can be found at https://github.com/abetlen/llama-cpp-python

A brief instructions are here.

Install llama-cpp-python package for CPU or GPU.

# For CPU Build
(venv)$ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

# For Nvidia GPU Build
(venv)$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

# For Intel ARC GPUS via SYCL
source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

As of this writing, llama.cpp does support Intel Arc GPUs through the SYCL interface, but the python bindings lacking some crucial updates though it does started supporting SYCL. For more details, refer to the Llama.cpp build instructions. The python bindings are expected to incorporate these updates soon, which will enable full working support for Intel Arc GPUs.

llama-cpp-python requires a GGUF model format, obtainable from Hugging Face. However, due to the API specifications of llama, we must manually download the model into a distinct directory and then indicate this directory’s path to llama-cpp-python. This time, diverging from our regular process, we’ll employ an instruction-tuned Llama 2 model, opting to download it directly to a specified local folder rather than the usual cache location.

(venv)$ huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks True

Enabling local-dir-use-symlinks by setting it to True can be a time and space-efficient choice, as it attempts to create a symbolic link in the current directory that points to the model already downloaded in the cache directory. If you prefer to have a complete copy of the model in the current directory without utilizing symbolic links, it’s best to omit this option.

Given that the model has been downloaded to the current directory, the model_path should be specified with either an absolute or a relative path.

inference.py

from llama_cpp import Llama
import llama_cpp
llm=llama_cpp.Llama(model_path="./llama-2-7b-chat.Q4_K_M.gguf",
                verbose=True, n_gpu_layers=-1, chat_format="llama-2")
prompt = '[INST] Hi there, write me 3 random quotes [/INST]'
stream = llm(prompt, max_tokens=2048, echo=False, temperature=0, stream=True)
result = ""
for output in stream:
    result += output['choices'][0]['text']
    print(output['choices'][0]['text'], end="")

The code provided above direct the system to load all layers (n_gpu_layers=-1) of the Llama2 7B model onto the GPU. Based on the available VRAM, you can modify the n_gpu_layers value to distribute the workload between the CPU and GPU, such as splitting it evenly to manage large models effectively.

Importance of Prompt Syntax

Since the model used above is an instruction tuned for chat, the prompt format is very important (e.g use of [INST]) because it significantly influences the model’s understanding and execution of tasks. Instruction-tuned models are specifically designed to follow detailed commands and generate responses based on the directives provided in the prompt. Correct syntax is essential for ensuring that the model precisely understands the instructions’ intent, resulting in outputs that are both accurate and pertinent. Moreover, various model architectures are fine-tuned to respond optimally to different prompt syntaxes.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

llama.cpp-based drop-in replacement for GPT-3.5.

This section explores how to utilize llama.cpp as a substitute for OpenAI’s GPT endpoints, enabling the operation of GPT-powered applications with local llama.cpp models rather than relying on OpenAI’s services. By running a local API server, you can mimic the functionality of OpenAI’s GPT API endpoints, yet process requests using local llama-based models. This means that applications designed for GPT-3.5 or GPT-4 can seamlessly transition to using llama.cpp. The end goal is to not only eliminate costs but also to ensure data remains private and secure within a local environment.

Local running LLM accessible through OpenAI API interface

First need to install llama-cpp-python with server support and dependencies. If the package was initially set up for CPU usage and you now wish to switch to GPU usage (or the other way around), you will need to execute the installation command again for the new target to reinstall it.

# For CPU Build
(venv)$ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir

# For Nvidia GPU Build
(venv)$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir

Install openai package

(venv)$ pip install openai

Download model to local folder using huggingface-cli tool.

(venv)$ huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks True

Execute the server with directions to structure the prompt in Llama-2 format before feeding to model.

(venv)$ python3 -m llama_cpp.server --model ./llama-2-7b-chat.Q4_K_M.gguf --n_gpu_layers 35   --chat_format llama-2

The server is configured to use port 8000 by default. In the client script below, which is designed to connect to the service, ensure you modify the port number as needed for custom port run. Please note that the model parameter is assigned a dummy value (e.g xxxxx) which is anyway ignored by the server and for the sake of simplicity in this example no authentication process is involved. Furthermore, to check that the server is running properly, you can visit http://localhost:8000/docs#/ in browser.

client.py

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1") 
stream = client.chat.completions.create(
  model="xxxxx",
  messages=[
    {"role": "system", "content": "You are a well-read scholar with a deep appreciation for literature, especially when it comes to the subject of artificial intelligence (AI)"},
    {"role": "user", "content": "Give me 3 quotes on AI."}
  ],
  temperature=0.01,
  stream=True,  
)
for chunk in stream:
    if not chunk.choices or chunk.choices[0].delta.content is None:
        continue
    print(chunk.choices[0].delta.content, end="")
print("\n")

Run client script in another terminal.

(venv)$ python client.py

The received output stream will be shown in the terminal window.

Output (As received from model):

Certainly! As a well-read scholar in the field of artificial intelligence, I have come across many thought-provoking quotes that offer unique perspectives on this fascinating topic. Here are three of my favorites:
1. “The future of humanity is in your hands. Learn to code, learn to create, learn to innovate. The world needs you.” — This quote by Satya Nadella, CEO of Microsoft, highlights the importance of coding and creativity in shaping the future of AI. It emphasizes the need for individuals to develop their skills in these areas in order to drive innovation and progress in the field.
2. “The AI revolution is not just about building machines that can think like humans; it’s about building machines that can think with humans.” — This quote by Rodney Brooks, robotics pioneer and MIT professor, underscores the importance of developing AI systems that can collaborate and communicate effectively with humans. It highlights the need for AI to be designed with human-centered approaches that prioritize collaboration and mutual understanding.
3. “The goal of AI is to make machines that can learn from experience and improve their performance without being explicitly programmed.” — This quote by Geoffrey Hinton, pioneer in the field of neural networks, highlights the potential of AI to learn and improve through machine learning and other autonomous methods. It emphasizes the importance of developing AI systems that can adapt and evolve over time, leading to more advanced and sophisticated technologies.
These quotes offer just a few of the many thought-provoking perspectives on AI that have been shared by scholars and innovators in the field. As our understanding of AI continues to evolve, these quotes serve as important reminders of the potential and challenges of this rapidly advancing technology.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

References

Hugging Face [https://huggingface.co/]
Python Bindings for llama.cpp [https://github.com/abetlen/llama-cpp-python]
Quantization [https://huggingface.co/docs/optimum/concept_guides/quantization]
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ) [https://towardsdatascience.com/which-quantization-method-is-right-for-you-gptq-vs-gguf-vs-awq-c4cd9d77d5be]
Nvidia CUDA [https://developer.nvidia.com/cuda-12-1-0-download-archive]

You can connect with me on LinkedIn

LLMs Made Accessible: A Beginner’s Unified Guide to Local Deployment via Python

Considerations Running Large Language Models (LLMs)

Using HuggingFace

Setup

Inference

Loading Pre-Quantized model

GGUF Quantization

LLama.cpp python

llama.cpp-based drop-in replacement for GPT-3.5.

References

Written by Arshad Mehmood