Shrinking Giants: Adapting Open Source AI for Everyday Devices

David Kolb
8 min readJan 10, 2024

--

A llama stands atop GPU stacks, overseeing a tech environment, blending whimsy with modern computing in a surreal depiction.
Image Midjourney V6 | David Kolb

2023 saw an explosion of open-source AI with notable releases from Meta’s Llama2 and the French startup Mistral.Ai’s Mixtral 8x7B. While democratising AI access, these giant models still pose infrastructure and cost challenges for many due to their substantial computing and memory requirements.

To address this barrier, several tech firms like Google and Apple are finding ways to train or compress these models so they can run efficiently on consumer-grade hardware. For example, Google recently introduced Gemini Nano. At the same time, Apple released its MLX framework, which utilises quantisation to enable speech recognition on Macs via OpenAI’s Whisper model.

Quantisation compresses large language models by simplifying the detailed numerical values representing the model’s learned knowledge. For example, a number like 0.12345678 might be approximated as 0.125. This makes the overall model representation less precise but crucially allows it to take up far less computer memory and run faster with no internet connection required.

So, in essence, quantisation trades off some marginal accuracy to enable models that would typically require expensive specialised hardware to now run smoothly on consumer devices like laptops and phones instead. The models still perform remarkably well but can now readily fit into applications on ubiquitous personal devices and platforms.

It’s analogous to artists working with a small colour palette — while they may convey the essence of an image, the subtleties and realism are missing compared to if that artist had a full spectrum of colours available.

Use Cases

  • AI on Consumer Devices: Enable real-time processing on phones/laptops without constant internet dependency
  • Startup Innovation: Develop cost-effective prototypes and minimum viable products powered by state-of-the-art AI
  • Research Tools: Support active learning and experimentation with quantised models on typical consumer devices

To see how this works, I ran an experiment using the Open Source project called Llama C++ (Georgi Gerganov). I chose Meta’s Llama2 model because it represents a leading open source conversational AI now freely available for research under a permissive license, while pre-quantised Llama2 models are emerging, I quantised the model myself for greater transparency and control to tune for my MacBook Pro’s constraints.

The overall goal is to compress an Open Source AI model for feasible deployment on a MacBook Pro M1 16 GB while preserving accuracy as much as possible.

Prerequisites

A free Hugging Face account to download models.

https://huggingface.co

Accept Meta’s Llama license terms and get access approval before downloading the models.

https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Quantisation Steps

  1. Set up GPU environment on Runpod for quantisation acceleration.
  2. Download 7B parameter Llama 2 model from HuggingFace.
  3. Configure Llama C++ toolkit.
  4. Quantise Llama to intermediate format FP16.
  5. Quantise to 4-bits for maximum compression.
  6. Run inference.
  7. Set up GPU environment for quantisation acceleration

While the end goal is to run compressed models on a MacBook Pro, quantising large AI models requires significant one-time compute resources for acceleration. For this quantisation exercise, I leveraged GPUs rented from RunPod’s cloud service, as they provide necessary performance without involving personal data.

Screenshot of RunPod Instance Status

Note — the larger the model, the more GPU memory is needed to quantise it. For this experiment, the GPU setup with an NVIDIA RTX6000 (48GB VRAM, 40GB container volume, 900GB disk volume) cost approximately $5 on RunPod. The service enabled convenient data transfers between cloud storage providers to stage the models.

https://www.runpod.io

Once the GPU environment was deployed, I could access Jupyter notebooks or the SSH terminal to run the quantisation.

2. Download 7B parameter Llama 2 model from HuggingFace

It can take a while to download the models, so my tip is to start this before you set up the Llama C++ toolkit and let the download run in the background.

Install the huggingface_hub library

pip install huggingface_hub

Generating Filenames for Different Model Formats

# Define the model ID for Hugging Face
model_id = 'meta-llama/Llama-2-7b-chat-hf'

# Extract the model name from the model_id by splitting the string
model_name = model_id.split('/')[-1]

# Append '-f16.gguf' to the model name for the FP16 formatted filename
filename_f16 = model_name + '-f16.gguf'

# Append '-Q4_K_M.gguf' to the model name for the 4-bit model filename
filename_q4_0 = model_name + '-Q4_K_M.gguf'

Run this code to download the model

from huggingface_hub import snapshot_download

# Retrieve your Hugging Face token from: Hugging Face
# Profile -> Settings -> Access Tokens
token = <HFTOKEN>


# Use snapshot_download to get the model
# Specify the local directory, symlink usage, revision, cache directory
# and token
snapshot_download(repo_id=model_id,
local_dir=model_name,
use_symlinks=False,
revision="main",
cache_dir="/workspace/hf/",
token=token)

Note that use has the cache_dir=”/workspace/hf/” parameter to create the cache on your workspace volume, not the default container volume.

Once downloaded, you can manually delete the HF cache to save space.

cd /workspace
rm -rf hf

3. Configure Llama C++ code

# Clone the llama.cpp repository from GitHub
git clone https://github.com/ggerganov/llama.cpp

# Navigate to the cloned repository directory
cd llama.cpp

# Install required Python packages from the requirements.txt file
pip install -r requirements.txt

# Compile the llama.cpp with GPU acceleration enabled
make LLAMA_CUBLAS=1

4. Quantise Llama to Floating Point 16-bit — intermediate format before 4-bit integer quantisation.

This step is required as quantising gradually in multiple steps (e.g. 32-bit -> 16-bit -> 4-bit) allows for better accuracy and training stability compared to direct quantisation (32-bit -> 4-bit).

import os

# Model file name as defined earlier,
# In this example the model was download to '/workspace/Llama-2-7b-chat-hf'

# Construct the command to convert the model to FP16 format using llama.cpp
command = f"python llama.cpp/convert.py {model_name} \
--outfile llama.cpp/models/{filename_f16} \
--outtype f16"

# Execute the command using os.system
os.system(command)

LLama quantisation to 16-bit floats completed in approximately 10 minutes. Once quantisation finished, the message appeared, indicating the quantised model file was written to the volume.

Wrote llama.cpp/models/Llama-2–7b-chat-hf-f16.gguf

5. Quantise Floating Point 16-bit to 4 bits to maximise compression

I am using the Q4_K_M quantisation scheme as experiments on HuggingFace have proven it strikes a good balance of compression and accuracy given my 16GB memory constraint.

# Construct the command for quantizing the model.
# The './quantize' script is used to convert the FP16 model file
# to a 4-bit quantized version.
command = (
f"./quantize ./models/{filename_f16} "
f"./models/{filename_q4_0} "
"Q4_K_M"
)

# Execute the quantization command using os.system
os.system(command)

Quantisation to 4-bit was completed in approximately 10 minutes. Once quantisation is finished, you get this message.

llama_model_quantize_internal: model size  = 12853.02 MB
llama_model_quantize_internal: quant size = 3891.24 MB

Original 32-bit Model Size: Size: 30 GB
FP16 Model Size: Approximately 12.55 GB.
Quantised 4-bit Model: Size: Approximately 3.80 GB.

6. Run inference

To run inference on the quantised Llama model and test its performance, we utilise the ./main executable provided in the llama.cpp toolkit.

This tool allows passing in the compressed model file and input text prompts to generate language predictions. Some key parameters we can specify are:

-m - Path to quantized model file
-p - Input prompt text
--color - Return colored text
-c - Number of token contexts
--temp - Temperature sampling
--repeat_penalty - Repetition penalty
--n

For example, the following command runs the 4-bit compressed Llama model to generate a short intro based on the given blog title:

./main -m ./models/Llama-2-7b-chat-hf-Q4_K_M.gguf \
-p "Write a short intro to this blog: Shrinking Giants" \
--color -c 4096 --temp 0.7 --repeat_penalty 1.1

I first tested inference performance on the Runpod GPU instance before transferring the quantised models to my local MacBook Pro to run. This ensured the toolkit and models were working properly before deployment.

I downloaded the model file to my MacBook using SCP. RunPod provides a cloud sync option to transfer between RunPod and AWS S3, Google Cloud, DropBox, Azure Blob or BackBlaze B2.

scp -r username@runpod-hostname:transfer/ \
models/Llama-2-7b-chat-hf-Q4_K_M.gguf \
~/Downloads

On the MacBook Pro M1 16GB. The 4-bit quantised Llama model achieved fast inference performance on consumer hardware. Some highlight metrics include:

  • 1.4 seconds to load the model
  • 73 ms average sampling time per prompt
  • 12.59 tokens/sec for text generation (faster than human read speed)
  • 11.9 seconds total runtime

By quantising down to 4 bits, the model size was reduced enough to allow responsive inference on a personal laptop with output generated at about 13 tokens per second. This rate was actually faster than I could read the text — demonstrating good enough throughput for local applications.

Screenshot from MacBook Terminal

In summary, through quantisation we attained a compressed model that runs efficiently on devices like a MacBook Pro with minimal accuracy loss.

Conclusion

In summary, this project demonstrated how quantisation can make large models accessible beyond the cloud, how complementary cloud and local device strategies maximise strengths, and the importance of responsible governance. These themes could drive much of the open source AI dialogue moving forward.

Here are three key takeaways

  1. Quantisation enables Open Source AI models to run efficiently on consumer devices, expanding access. By compressing models down from 32 bits to 4 bits, quantisation reduced the Llama model by 87% with minimal accuracy loss. This allowed fast beyond reading speed inference on a MacBook Pro.
  2. Combining cloud and local devices is an impactful approach. Leveraging cloud GPUs for the resource-intensive one-time quantisation step and deploying the compressed model locally provides a best-of-both-worlds solution. This balances capabilities, cost and privacy.
  3. Responsible governance is pivotal as AI proliferates. Initiatives like the EU’s AI Act, which promotes research while ensuring accountability, will help guide development. Structured oversight of open source projects enables innovation on top of an ethical foundation.
  4. Quantisation shows promise to enhance AI efficiency yet remains in the early research stages. Start initiating small-scale pilot projects or experiments with quantisation for hands-on learning and understanding of how quantisation can be integrated and what benefits it may offer.

In my next post, I will showcase running the quantised Llama-2 model on a laptop for chat applications. This will demonstrate the real-world viability of deploying optimised open source AI on everyday consumer devices.

--

--

David Kolb

Innovation Strategist & Coach | Cyclist 🚴‍♀️ | Photographer 📸 | IDEO U Alumni Coach