AirLLM Unleashed

3 min readJan 21, 2024

Efficiently Running 70B LLM Inference on a 4GB GPU

Introduction

In the realm of language models, size often matters, and larger models tend to deliver better performance. However, working with massive language models can be a daunting task, especially when it comes to inference on limited hardware like a 4GB GPU. Enter AirLLM, a groundbreaking solution that enables the execution of 70B large language models (LLMs) on a single 4GB GPU without compromising on performance. In this post, we’ll dive deep into the intricacies of AirLLM and how it transforms the way we approach LLM inference.

AirLLM: Revolutionizing LLM Inference

AirLLM is a game-changer in the world of LLMs, allowing for the efficient execution of colossal models on relatively modest hardware. Here’s what sets AirLLM apart:

Layer-wise Inference: AirLLM employs a “divide and conquer” approach by executing LLM layers sequentially. During inference, each layer operates independently, relying solely on the output of the previous layer. This means that once a layer completes its computations, its memory can be released, keeping only the layer’s output. Consequently, the GPU memory required per layer is significantly reduced, approximately 1.6GB for a 70B model.
Flash Attention: To further optimize memory access and boost performance, AirLLM implements flash attention, ensuring efficient CUDA memory usage.
Layer Sharding: AirLLM divides model files by layers, simplifying the loading process and minimizing memory footprint.
Meta Device Feature: Leveraging HuggingFace Accelerate’s meta device feature, AirLLM loads model data without actually reading it, resulting in zero memory usage.
Quantization Options: AirLLM provides the flexibility to explore quantization with a ‘compression’ parameter, supporting 4-bit or 8-bit block-wise quantization.

How to Get Started with AirLLM

Install the Package: Begin by installing the AirLLM pip package using the following command:

pip install airllm

2. Initialize AirLLM: Initialize AirLLM by providing the HuggingFace repo ID of the model you wish to use or its local path. Inference can then be performed similarly to a regular transformer model.

from airllm import AutoModel

MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
        'What is the capital of United States?',
        #'I like',
    ]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False)
           
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Model Compression for Lightning-Fast Inference

AirLLM takes inference speed to the next level with model compression. Based on block-wise quantization, this feature can accelerate inference by up to 3x without significant accuracy loss. To enable model compression:

Ensure you have ‘bitsandbytes’ installed using pip install -U bitsandbytes.
Update AirLLM to version 2.0.0 or later with pip install -U airllm.
When initializing the model, specify the ‘compression’ parameter as ‘4bit’ for 4-bit block-wise quantization or ‘8bit’ for 8-bit.

Configurations and More

When initializing the model, AirLLM supports various configurations like:

Choose compression options: 4bit or 8bit (default: None)
Enable profiling mode for time consumption analysis.
Specify a custom path for saving the split model.
Provide HuggingFace token for gated models.
Utilize prefetching to optimize model loading.
Save disk space by deleting the original downloaded model.

These configurations provide flexibility and control over your inference process.

Conclusion

AirLLM is a groundbreaking tool that empowers developers and researchers to harness the full potential of large language models on resource-constrained hardware. With its innovative approach to layer-wise inference, model compression, and ongoing updates, AirLLM is revolutionizing the way we work with LLMs. Try AirLLM today to experience efficient and lightning-fast inference like never before.

Reference:

https://pypi.org/project/airllm/
Colab: link