AirLLM Unleashed
Efficiently Running 70B LLM Inference on a 4GB GPU
Introduction
In the realm of language models, size often matters, and larger models tend to deliver better performance. However, working with massive language models can be a daunting task, especially when it comes to inference on limited hardware like a 4GB GPU. Enter AirLLM, a groundbreaking solution that enables the execution of 70B large language models (LLMs) on a single 4GB GPU without compromising on performance. In this post, we’ll dive deep into the intricacies of AirLLM and how it transforms the way we approach LLM inference.
AirLLM: Revolutionizing LLM Inference
AirLLM is a game-changer in the world of LLMs, allowing for the efficient execution of colossal models on relatively modest hardware. Here’s what sets AirLLM apart:
- Layer-wise Inference: AirLLM employs a “divide and conquer” approach by executing LLM layers sequentially. During inference, each layer operates independently, relying solely on the output of the previous layer. This means that once a layer completes its computations, its memory can be released, keeping only the layer’s output. Consequently, the GPU memory required per layer is significantly reduced, approximately 1.6GB for a 70B model.
- Flash Attention: To further optimize memory access and boost performance, AirLLM implements flash attention, ensuring efficient CUDA memory usage.
- Layer Sharding: AirLLM divides model files by layers, simplifying the loading process and minimizing memory footprint.
- Meta Device Feature: Leveraging HuggingFace Accelerate’s meta device feature, AirLLM loads model data without actually reading it, resulting in zero memory usage.
- Quantization Options: AirLLM provides the flexibility to explore quantization with a ‘compression’ parameter, supporting 4-bit or 8-bit block-wise quantization.
How to Get Started with AirLLM
- Install the Package: Begin by installing the AirLLM pip package using the following command:
pip install airllm
2. Initialize AirLLM: Initialize AirLLM by providing the HuggingFace repo ID of the model you wish to use or its local path. Inference can then be performed similarly to a regular transformer model.
from airllm import AutoModel
MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
input_text = [
'What is the capital of United States?',
#'I like',
]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Model Compression for Lightning-Fast Inference
AirLLM takes inference speed to the next level with model compression. Based on block-wise quantization, this feature can accelerate inference by up to 3x without significant accuracy loss. To enable model compression:
- Ensure you have ‘bitsandbytes’ installed using
pip install -U bitsandbytes
. - Update AirLLM to version 2.0.0 or later with
pip install -U airllm
. - When initializing the model, specify the ‘compression’ parameter as ‘4bit’ for 4-bit block-wise quantization or ‘8bit’ for 8-bit.
Configurations and More
When initializing the model, AirLLM supports various configurations like:
- Choose compression options: 4bit or 8bit (default: None)
- Enable profiling mode for time consumption analysis.
- Specify a custom path for saving the split model.
- Provide HuggingFace token for gated models.
- Utilize prefetching to optimize model loading.
- Save disk space by deleting the original downloaded model.
These configurations provide flexibility and control over your inference process.
Conclusion
AirLLM is a groundbreaking tool that empowers developers and researchers to harness the full potential of large language models on resource-constrained hardware. With its innovative approach to layer-wise inference, model compression, and ongoing updates, AirLLM is revolutionizing the way we work with LLMs. Try AirLLM today to experience efficient and lightning-fast inference like never before.
Reference: