AirLLM Unleashed

Haribaskar Dhanabalan
3 min readJan 21, 2024

--

Efficiently Running 70B LLM Inference on a 4GB GPU

Introduction

In the realm of language models, size often matters, and larger models tend to deliver better performance. However, working with massive language models can be a daunting task, especially when it comes to inference on limited hardware like a 4GB GPU. Enter AirLLM, a groundbreaking solution that enables the execution of 70B large language models (LLMs) on a single 4GB GPU without compromising on performance. In this post, we’ll dive deep into the intricacies of AirLLM and how it transforms the way we approach LLM inference.

AirLLM: Revolutionizing LLM Inference

AirLLM is a game-changer in the world of LLMs, allowing for the efficient execution of colossal models on relatively modest hardware. Here’s what sets AirLLM apart:

  1. Layer-wise Inference: AirLLM employs a “divide and conquer” approach by executing LLM layers sequentially. During inference, each layer operates independently, relying solely on the output of the previous layer. This means that once a layer completes its computations, its memory can be released, keeping only the layer’s output. Consequently, the GPU memory required per layer is significantly reduced, approximately 1.6GB for a 70B model.
  2. Flash Attention: To further optimize memory access and boost performance, AirLLM implements flash attention, ensuring efficient CUDA memory usage.
  3. Layer Sharding: AirLLM divides model files by layers, simplifying the loading process and minimizing memory footprint.
  4. Meta Device Feature: Leveraging HuggingFace Accelerate’s meta device feature, AirLLM loads model data without actually reading it, resulting in zero memory usage.
  5. Quantization Options: AirLLM provides the flexibility to explore quantization with a ‘compression’ parameter, supporting 4-bit or 8-bit block-wise quantization.

How to Get Started with AirLLM

  1. Install the Package: Begin by installing the AirLLM pip package using the following command:
pip install airllm

2. Initialize AirLLM: Initialize AirLLM by providing the HuggingFace repo ID of the model you wish to use or its local path. Inference can then be performed similarly to a regular transformer model.

from airllm import AutoModel

MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
'What is the capital of United States?',
#'I like',
]

input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)

generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Model Compression for Lightning-Fast Inference

AirLLM takes inference speed to the next level with model compression. Based on block-wise quantization, this feature can accelerate inference by up to 3x without significant accuracy loss. To enable model compression:

  1. Ensure you have ‘bitsandbytes’ installed using pip install -U bitsandbytes.
  2. Update AirLLM to version 2.0.0 or later with pip install -U airllm.
  3. When initializing the model, specify the ‘compression’ parameter as ‘4bit’ for 4-bit block-wise quantization or ‘8bit’ for 8-bit.
Ref: https://pypi.org/project/airllm/

Configurations and More

When initializing the model, AirLLM supports various configurations like:

  • Choose compression options: 4bit or 8bit (default: None)
  • Enable profiling mode for time consumption analysis.
  • Specify a custom path for saving the split model.
  • Provide HuggingFace token for gated models.
  • Utilize prefetching to optimize model loading.
  • Save disk space by deleting the original downloaded model.

These configurations provide flexibility and control over your inference process.

Conclusion

AirLLM is a groundbreaking tool that empowers developers and researchers to harness the full potential of large language models on resource-constrained hardware. With its innovative approach to layer-wise inference, model compression, and ongoing updates, AirLLM is revolutionizing the way we work with LLMs. Try AirLLM today to experience efficient and lightning-fast inference like never before.

Reference:

  1. https://pypi.org/project/airllm/
  2. Colab: link

--

--

Haribaskar Dhanabalan

📊 Senior Data Scientist | GenAI | Turning Data into Insights | Lifelong Learner 🚀 #DataScience #ML #AI