Running LLMs in their native format on consumer hardware just got a whole lot easier

Published in

High Tech Accessible

7 min read2 days ago

While the rate of advancements in the LLM space continues to impress and overwhelm, deploying models as-a-service, offline & on-device continues to remain a key challenge. Especially on commonly available hardware — the workstations, desktops, laptops and even non-specialized servers common to most of us.

This challenge is down to two major factors:

1. LLMs require a lot of memory and thus must be compressed, via a process called Quantization, to be usable, even on very powerful high-end computers and

2. Almost all applications aimed at running LLMs on common hardware rely on a server application called ‘llama.cpp’ for the job, which requires LLMs to be packaged in a specialized, quantized format called ‘GGUF’ — but the process of converting LLMs to GGUF can be error-prone for cutting-edge LLMs, necessitating waiting on updates to ‘llama.cpp’ followed by lengthy recompilations and testing, all of which can be challenging in production environments

LLMs are typically open-sourced on HuggingFace.co, in the ‘HuggingFace-Transformers’ format. From here, they undergo quite the journey to the land of GGUFs:

Download the entire HF-Transformers model from the HuggingFace-Hub, typically via cloning
Clone the llama.cpp repository
Using the ‘convert_hf_to_gguf.py’ script bundled with llama.cpp, convert the HF-Transformers LLM into a .BIN file — only works if llama.cpp supports the model’s tokenizer! Otherwise, hope the model is popular and wait a bunch for an update to llama.cpp. Also, make sure to specify the correct data type when running the script (‘BF16’ for BFloat16 models etc.)
Compile llama.cpp
Run the generated llama-quantize utility to convert the BIN to a GGUF
Run GGUF with the llama-cli or llama-server utility. Hope nothing went wrong or broke the model as it went through the above journey.

Alternatively, you may simply download pre-made GGUFs uploaded to the hub by others, but you’ll have to test the model to make sure it works as intended and is unaltered. Model creators also may not always specify which version of llama.cpp they used to generate the GGUF, which can be problematic as updates to llama.cpp often necessitate or strongly encourage repeating the above process to generate a fresh GGUF, for instance when the LLM is based on a new & previously unsupported tokenizer or attention mechanism.

This challenge is compounded by the frequent release-cycle followed by llama.cpp — often several new versions each day thus bringing to mind the old phrase — “too much of a good thing can be bad”!

The need to run ever-newer LLMs and the challenge of doing so with llama.cpp

This is important to address given llama.cpp’s popularity and dominance in the local-inferencing space.

Several high-quality, open-source large language models have been released in just the recent few weeks, comprising Meta’s Llama-3.1, Google’s Gemma2 and Microsoft’s Phi3 families of models. Mistral too has enthralled the space with Nemo, Codestral and Mamba-Codestral. Make no mistake, each of these furthers the field significantly:

1. Llama-3.1 set new records for open-source LLMs in the domains of reasoning, accuracy and problem-solving, advancing on the Llama-3 family with a massive 128k context-length and the ability to use tools to better respond to user-queries

· Deployment challenges encountered with llama.cpp: Took significant time and several updates after model release for proper support — primary challenge seemingly steamed from the use of RoPE scaling in Llama-3.1 for extended context-length support; Tool and function calling not supported.

2. Gemma2 introduced very high-performance models at relatively moderate sizes, thus enabling the deployment of very high-quality LLMs on relatively modest hardware

· Deployment challenges encountered with llama.cpp: Took a while after model release for proper support to be confirmed — primary challenge steamed from the LLMs use of a sliding-window attention mechanism

3. Phi3 demonstrated that even small models when trained on an efficient though modest corpus of data can indeed perform well in real-world applications, such as retrieval-augmentation (RAG) usecases

· Deployment challenges encountered with llama.cpp: Took a while after model release for proper support to be confirmed — primary challenge steamed from the LLMs use of a different end-of-stream token in its prompt-template: <|endoftext|> instead of <|end|>

4. Mistral’s Nemo with its 12 billion (12B) parameters plugs a gap in the space for those with more-than-modest hardware who were forced to run sub-10B sized models because everything else was too large!

· Deployment challenges encountered with llama.cpp: Took significant time and several updates after model release for proper support — primary challenge steamed from the LLMs use of a new tokenizer named ‘Tekken’; CUDA-support was buggy and unreliable with GPU-inferencing failing, not sure if fixed yet.

5. Codestral demonstrates state-of-the-art coding performance in a range of benchmarks

· Deployment challenges encountered with llama.cpp: Unclear on official support — an issue-thread (https://github.com/ggerganov/llama.cpp/issues/7622) requesting support was auto-closed as stale, model not listed under the models support list in llama.cpp’s README, but some claim it works fine.

6. Mamba is a big deal: while all LLMs today are based on the transformers architecture, Mamba-Codestral is based on entirely new internal architecture, the ‘Mamba’ architecture — a promising newcomer in the world of LLMs! Mamba potentially boasts several key advantages over LLMs built on the traditional transformer architecture, primarily:

· Much better attention: the ability to effectively process very large inputs as a result of Mamba’s state space model architect, designed to capture long-range dependencies more effectively that the fixed attention-windows of transformer models

· Faster inferencing: linear O(n) time-complexity with respect to sequence-length, as compared to Transformers O(n²) quadratic complexity

· Smaller memory footprint: Once again, Mamba boasts a linear O(n) space-complexity as compared to the O(n²) quadratic complexity of Transformers

Mamba, for obvious reasons, is not supported by the Transformers library and thus cannot yet be run with HF-Waitress! It cannot be run with llama.cpp either., and instead requires Mistral’s own `mistral-inference` Python lib. However, it still serves as a good demonstration of why relying on a single backend is not an effective long-term strategy!

On that note, I will also look into adding support for `mistral-inference` into HF-Waitress if demand and need is strong enough.

The llama.cpp challenges described above often necessitated multiple recompiles of llama.cpp followed by re-quantization of the LLMs, significantly time-consuming operations. Confusion pertaining to available GGUF quants continues to persist in some cases as some GGUF-creators do not specify the version of llama.cpp they used to generate the quants. Nor do they update their repositories.

As LLMs are typically open-sourced on HuggingFace.co in the ‘HuggingFace-Transformers’ format, I couldn’t help but think “there’s got to be a better way”!

Well there wasn’t — so I built and open-sourced it myself!

Presenting `HF-Waitress`

HF-Waitress is a server application that enables running HF-Transformer & AWQ-quantized models directly off the HuggingFace-Hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto for the former. It negates the need to manually download models yourself, simply working off the models name instead. It requires no setup and provides concurrency and streaming responses all from within a single, easily portable, platform-agnostic Python script.

‘Platform-agnostic’ here implies it works on Windows, Linux, MacOS & Docker containers too, but it’s also hardware-agnostic as it enables you to run native or quantized models on all manner of CPUs, GPUs and Apple M-silicon.

At under a thousand lines of code, HF-Waitress is as lightweight as can be especially considering its featureset:

- On-the-fly, in-place quantization:

- `Quanto` `int8`, `int4` and `int2` quantization for all hardware,

- `BitsAndBytes` `int8` & `int4` quantization for Nvidia GPUs,

- `HQQ` `int8`, `int4`, `int3`, `int2`, `int1` quantization for Nvidia & AMD GPUs.

- Activation-Aware Quantization (AWQ) Support: load AWQ-quantized models from HF-Hub.

- Model Agnosticism: Compatible with any HF-Transformers format LLM.

- Configuration Management: Uses `config.json` to store settings, allowing for easy configuration and persistence across runs.

- Hardware & Platform Agnostic: Run native or quantized models on all manner of CPUs, GPUs and Apple M-silicon across Windows, Linux, MacOS & Docker containers.

- Error Handling: Detailed logging and traceback reporting via centralized error-handling functions.

- Health Endpoint: Provides valuable information about the loaded model and server health.

- Concurrency Control: Uses semaphores for selective concurrency while taking advantage of semaphore-native queueing.

-Streaming Responses: Supports both standard and streaming completions.

Here’s an overview of the available API endpoints as of today:

1. `/completions` (POST): Generate completions for given messages.

2. `/completions_stream` (POST): Stream completions for given messages.

3. `/health` (GET): Check the health and get information about the loaded model.

4. `/hf_config_reader_api` (POST): Read values from persistent configuration memory.

5. `/hf_config_writer_api` (POST): Write values to the persistent configuration memory.

6. `/restart_server` (GET): Restart the LLM server.

Check out the GitHub repository linked below to obtain the server, source-code and detailed documentation on all the above aspects. As always, I welcome discussions and contributions!

GitHub - abgulati/hf-waitress: Serving LLMs in the HF-Transformers format via a PyFlask API

Serving LLMs in the HF-Transformers format via a PyFlask API - abgulati/hf-waitress

github.com

A closing note on llama.cpp

Llama.cpp and its GGUFs will likely remain the best option for hybrid CPU+GPU inferencing, wherein model layers are distributed across the memory of these processors, enabling the utilization of all available RAM in a single system. I am actively looking into quantization techniques that will enabling such hybrid inferencing in HF-Waitress too though, so stay tuned!

All said and done, llama.cpp is an amazing project and really represents open-source at it’s best: it’s a testament to what a community of willing contributers can create, and the freedom and enablement afforded to all by open-source projects.

llama.cpp and it’s creator and contributors have been very inspring to me on every level as a developer and I always wish it continued success. It serves as my inspiration to build and open-source the tools I do!