LLM Inference on multiple GPUs with 🤗 Accelerate

Minimal working examples and performance benchmark

Geronimo
5 min readNov 27, 2023

Large Language Models (LLMs) have revolutionised the field of natural language processing. As these models grow in size and complexity, the computational demands for inference also increase significantly. To tackle this challenge, leveraging multiple GPUs becomes essential.

This story covers

  • How to perform inference on multiple GPUs in parallel with the
    🤗 Accelerate package.
  • Straightforward approach with minimal working code examples
  • Performance benchmark showcasing the overhead introduced by using multiple GPUs

Hello World example

This section introduces the basic setup and a simple example to demonstrate multi-GPU “message passing” using Accelerate.

from accelerate import Accelerator
from accelerate.utils import gather_object

accelerator = Accelerator()

# each GPU creates a string
message=[ f"Hello this is GPU {accelerator.process_index}" ]

# collect the messages from all GPUs
messages=gather_object(message)

# output the messages only on the main process with accelerator.print()
accelerator.print(messages)

Output of accelerate launch hello_world.py :

['Hello this is GPU 0', 
'Hello this is GPU 1',
'Hello this is GPU 2',
'Hello this is GPU 3',
'Hello this is GPU 4']

Multi GPU inference (simple)

The following is a simple, non-batched approach to inference.

from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json

accelerator = Accelerator()

# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
"The King is dead. Long live the Queen.",
"Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
"The story so far: in the beginning, the universe was created.",
"It was a bright cold day in April, and the clocks were striking thirteen.",
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
"The sweat wis lashing oafay Sick Boy; he wis trembling.",
"124 was spiteful. Full of Baby's venom.",
"As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
"I write this sitting in the kitchen sink.",
"We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 10

# load a base model and tokenizer
model_path = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map={"": accelerator.process_index},
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# sync GPUs and start the timer
accelerator.wait_for_everyone()
start=time.time()

# divide the prompt list onto the available GPUs
with accelerator.split_between_processes(prompts_all) as prompts:
# store output of generations in dict
results=dict(outputs=[], num_tokens=0)

# have each GPU do inference, prompt by prompt
for prompt in prompts:
prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]

# remove prompt from output
output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]

# store outputs and number of tokens in result{}
results["outputs"].append( tokenizer.decode(output_tokenized) )
results["num_tokens"] += len(output_tokenized)

results=[ results ] # transform to list, otherwise gather_object() will not collect correctly

# collect results from all the GPUs
results_gathered=gather_object(results)

if accelerator.is_main_process:
timediff=time.time()-start
num_tokens=sum([r["num_tokens"] for r in results_gathered ])

print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

Performance

Using multiple GPUs seems to lead to some overhead in communication: performance increases linearly up to 4 GPUs and then plateaus in this particular setup. Performance likely depends on many parameters such as model size and quantisation, prompt length, number of tokens generated, and sampling strategy.

  • 1 GPU: 44 tokens/sec, time: 225.5s
  • 2 GPUs: 88 tokens/sec, time: 112.9s
  • 3 GPUs: 128 tokens/sec, time: 77.6s
  • 4 GPUs: 137 tokens/sec, time: 72.7s
  • 5 GPUs: 119 tokens/sec, time: 83.8s
meta-llama/Llama-2–7b, 100 prompts, 100 tokens generated per prompt, 1–5x NVIDIA GeForce RTX 3090 (power cap 290 W)

Multi GPU inference (batched)

In a real-world rather than hello-world example one would use batched inference to speed things up.

from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json

accelerator = Accelerator()

def write_pretty_json(file_path, data):
import json
with open(file_path, "w") as write_file:
json.dump(data, write_file, indent=4)

# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
"The King is dead. Long live the Queen.",
"Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
"The story so far: in the beginning, the universe was created.",
"It was a bright cold day in April, and the clocks were striking thirteen.",
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
"The sweat wis lashing oafay Sick Boy; he wis trembling.",
"124 was spiteful. Full of Baby's venom.",
"As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
"I write this sitting in the kitchen sink.",
"We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 10

# load a base model and tokenizer
model_path = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map={"": accelerator.process_index},
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

# batch, left pad (for inference), and tokenize
def prepare_prompts(prompts, tokenizer, batch_size=16):
batches=[prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]
batches_tok=[]
tokenizer.padding_side="left"
for prompt_batch in batches:
batches_tok.append(
tokenizer(
prompt_batch,
return_tensors="pt",
padding='longest',
truncation=False,
pad_to_multiple_of=8,
add_special_tokens=False).to("cuda")
)
tokenizer.padding_side="right"
return batches_tok

# sync GPUs and start the timer
accelerator.wait_for_everyone()
start=time.time()

# divide the prompt list onto the available GPUs
with accelerator.split_between_processes(prompts_all) as prompts:
results=dict(outputs=[], num_tokens=0)

# have each GPU do inference in batches
prompt_batches=prepare_prompts(prompts, tokenizer, batch_size=16)

for prompts_tokenized in prompt_batches:
outputs_tokenized=model.generate(**prompts_tokenized, max_new_tokens=100)

# remove prompt from gen. tokens
outputs_tokenized=[ tok_out[len(tok_in):]
for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]

# count and decode gen. tokens
num_tokens=sum([ len(t) for t in outputs_tokenized ])
outputs=tokenizer.batch_decode(outputs_tokenized)

# store in results{} to be gathered by accelerate
results["outputs"].extend(outputs)
results["num_tokens"] += num_tokens

results=[ results ] # transform to list, otherwise gather_object() will not collect correctly

# collect results from all the GPUs
results_gathered=gather_object(results)

if accelerator.is_main_process:
timediff=time.time()-start
num_tokens=sum([r["num_tokens"] for r in results_gathered ])

print(f"tokens/sec: {num_tokens//timediff}, time elapsed: {timediff}, num_tokens {num_tokens}")

Performance

Using generate with batches of prompts speeds up things quite a bit. Increasing the number of GPUs, again, shows a plateau in performance at 4 GPUs.

  • 1 GPU: 520 tokens/sec, time: 19.2s
  • 2 GPUs: 900 tokens/sec, time: 11.1s
  • 3 GPUs: 1205 tokens/sec, time: 8.2s
  • 4 GPUs: 1655 tokens/sec, time: 6.0s
  • 5 GPUs: 1658 tokens/sec, time: 6.0s
meta-llama/Llama-2–7b, 100 prompts, 100 tokens generated per prompt, batch size 16, 1–5x NVIDIA GeForce RTX 3090 (power cap 290 W)

Summary

  • Multi-GPU inference using Hugging Face’s Accelerate package significantly improves performance, especially when using batch processing.
  • The overhead from communication between GPUs is significant and noticeable with a higher number of GPUs.

For a production system using multiple GPUs you should probably look into tools like VLLM, 🤗 Text Generation Inference and FastChat.

I hope you enjoyed this short story! If you have any feedback, additional ideas, or questions, feel free to leave a comment here or reach out on Twitter.

--

--