Changing the GPU is changing the behaviour of your LLM.

12 min readMay 27, 2024

Most tech people know that varying versions of dependencies can result in different behaviors. However, in the realm of Large Language Models, because we need a lot of compute, we heavily depend on GPUs for both training and inference tasks. Yet, few are truly aware that switching GPUs can also affect the output of your LLM.

So you’re trying to create two identical environments?
You can set the dependency versions.
You can use Dockerization.
You can set the LLM temperature to 0.
You can set whatever seed you want.
At the end of the day none of this will work unless you haven’t used the same exact GPU model.

In this article, I’ll highlight this phenomenon with an experiment showing where differences occur and why.

Note: You can skip the code snippets if you’re not interested in reproducing the experiment or in the code (you can directly go to the section “7. Why are the answers generated by the same inputs and the same LLM so different across two GPUs?”; the conclusion will still be valuable for understanding what’s going on.

1. Why this article ?

One day, I was discussing with some folks why OpenAI and Anthropic models aren’t deterministic by design. I explained that they might use a Mixture of Experts (MoE) approach, occasionally not routing tokens to the optimal experts because these ones are too busy handling other tokens, which results in inconsistent answers.

Another factor could be OpenAI’s batching of queries for efficiency. The size of these batches can vary depending on the volume of incoming queries which can alter GPU computation strategies, leading to different outcomes.

The conversation turned intriguing when someone pointed out, “Different GPUs could also lead to different results, couldn’t they?”

When you think about it… when you use the OpenAI API, there is somewhere a remote machine that runs the computation on your behalf and returns you the result. Now, if the machine doesn’t always run on the same hardware, ultimately you won’t be getting the same output.

With that in mind other considerations can arise:

What if I have an LLM app in production and I need to scale to other instances that have different GPUs, will it be that much of a deal?
What if the development environment has a GPU that is different from the one in production?

All these questions made me want to set up an experiment to highlight the phenomenon and see how significant the impact it can be.

2. Setup the experimentation

To highlight the phenomenon, I’ll set up two identical environments, differing only in their GPUs: the Nvidia Tesla T4 in the first and the Nvidia A10G in the second. Then we’ll play a bit with Mistral-7b-v0.1 and see what happens.

To run the experiment in a notebook let’s follow these steps.

Setup the environment

Set CUDA version.

!pip uninstall torchvision torchtext torchaudio torch -y
!pip install torchvision==0.16.0 torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html

2. Set versions of transformers and Co.

!pip3 uninstall accelerate -y bitsandbytes -y transformers -y datasets -y
!pip3 install accelerate==0.28.0 bitsandbytes==0.43.0 transformers==4.39.3 datasets==2.18.0

3. Set random seeds:

# Setting the seed ensures consistent, reproducible results.
import random
import numpy as np
import torch
from transformers import set_seed

# Set seeds for reproducibility
random_seed = 42
np_seed = 42
torch_seed = 42
transformers_seed = 42

random.seed(random_seed)
np.random.seed(np_seed)
torch.manual_seed(torch_seed)
set_seed(transformers_seed)

Note 1: Setting only transformers.set_seedshould be enough, but I wanted to be on the safe side.

Note 2: For this example we’re using Python 3.10

Load Mistral

To load the Mistral-7B-v0.1 model from Hugging Face, you must set your Hugging Face token in the environment variable HF_TOKEN .
We will use a quantized version of the model, that is to say we lower the precision of the calculations to reduce the GPU’s memory footprint.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from time import time
import os

model_name = "mistralai/Mistral-7B-v0.1"
device = "cuda" # the device to load the model onto

# I'll keep it like this for simplicity but It's better to put your token
# Inside an .env file and load it with load_dotenv function from 'python-dotenv' lib.
os.environ["HF_TOKEN"] = "<YOUR HF TOKEN HERE>"

double_quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype = "float16"
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side="right",
    add_eos_token=False,
    add_bos_token=False,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=double_quant_config)

Using transformers pipeline

We’ll use pipeline from the transformers library to simplify generating output from the LLM.

For deterministic reasons, we want to consistently predict the most likely token from the LLM’s vocabulary, so we can either configure top_k=1 or temperature to be a value close to 0.

Also, for simplicity sake, we’ll be setting the max_new_tokens parameter to 1 so that the LLM will complete our prompt with just one token.

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    add_special_tokens=False,
    max_new_tokens=1,
    temperature=0.00000000001,
    repetition_penalty=1.4
)

sentence = "I enjoy walking in the"

response = pipe(sentence)[0]['generated_text']
print(response)

# >>> I enjoy walking in the woods

When we prompt the sequence “I enjoy walking in the”, the LLM will output only one word “woods”. If your LLM returns the correct output, we can move on to the experiment.

3. The experiment results: T4 vs A10G

To access these two GPUs, I launched instances of ml.g4dn.xlarge (T4) and ml.g5.xlarge (A10G) through AWS SageMaker.

Let’s try a simple query:

# the prompt
Answer the question in a very concise way
Question: What is so special with Large Langage Models ?
Answer:

prompt = "<s>[INST]Answer the question in a very concise way[/INST] \nQuestion: What is so special with Large Langage Models ? \nAnswer:"
response = pipe(prompt)[0]['generated_text']
print(response)

The answer I get from both T4 and A10G is the same:

Question: What is so special with Large Langage Models ?  
Answer: They are able to generate text that looks like human-written. This means they can be used for many tasks, such as translation or summarization of texts (either from one language into another and vice versa). The model itself does not need any training data but only needs some examples how it should behave when generating new sentences based on its input sentence(s) which makes them much easier than other models because there's no need anymore having lots different datasets available beforehand!

So far so good. However, this was a short query. For RAG use cases, we typically send thousands of tokens. Let’s test with a larger query using the llama-2-arxiv-papers-chunkeddataset hosted on Hugging Face.

In the following code, I’ll mimic how RAG works using retrieved chunks from indices 0, 4518, 4519, and 799 of the dataset. Chunks 4518 and 4519 discuss Llama 2, while the others do not. We expect the LLM to answer the question ‘What is so special about Llama 2?’ using this context. This prompt is approx 1,400 tokens long.

# dataset
from datasets import load_dataset

dataset = load_dataset(
    "jamescalam/llama-2-arxiv-papers-chunked",
    split="train"
)

df = dataset.to_pandas()

indices = [0, 4518, 4519, 799]
retrieved_chunks = [f"title: {df.at[i, 'title']}\n{df.at[i, 'chunk']}" for i in indices]

retrieved_chunks_str = "\n\n".join(retrieved_chunks)

prompt = f"""\
<s>[INST]Context information is below.
---------------------
{retrieved_chunks_str}
---------------------
Given the context information and not prior knowledge, answer the user's query.
You should mention the titles of the documents your answer is based on.
[/INST]
## Query: What is so special about Llama 2 ?
## Answer:
"""

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    add_special_tokens=False,
    max_new_tokens=300,
    temperature=0.00000000001,
    repetition_penalty=1.4
)

response = pipe(prompt)[0]['generated_text']
print(response.strip(prompt))

The output of the T4 :

The main difference between Llama 2 and other large language model fine tunings like Bloom or Chinchilla is how they were trained. While these two methods train their models with data scraped off the internet, Llama 2 uses only text generated during training time itself - making sure there’s less chance of bias being introduced into the system due to external sources such as social media posts etc.. This also means you can trust what comes out more since everything inside will be consistent across different runs! Additionally, because each word token gets assigned an individual weight value instead just one global score per sentence; users get better control when trying specific tasks – something else lacking within competing systems today. Lastly yet importantly too ; unlike others who rely heavily upon expensive hardware resources needed constantly updating themselves every few months if not sooner than expected…

The output of the A10G:

The main difference between Llama 2 and other large language model fine tunings like Bloom or Chinchilla is how they were trained. While these two methods train their models with data scraped off the internet, Llama 2 uses only text generated during training time itself - making sure there’s less chance of bias being introduced into the system due to external sources such as social media posts etc.. This also means you can be more confident when asking questions related specifically towards topics covered within those texts since any biases would have already been removed beforehand! Additionally, because each word used here comes directly after another one instead having random words inserted randomly throughout sentences – meaning fewer errors occur while reading through them compared against traditional machine learning approaches where every single sentence must contain some kind of error correction mechanism built inside themselves. Finally yet importantly enough though? It doesn’t matter what type question someone asks; whether short form queries made via Twitter DM messages sent privately among friends & family members alike OR longform essays written down manually onto paper sheets then scanned digitally later…you will always get an accurate response back without fail thanks largely attributed solely too AI technology behind everything happening under hood!!!

That’s very interesting. At first glance, it isn’t noticeable since both answers start the same way. However, just after the “etc…”, they diverge.

On the T4 side: “etc… This also means you can trust the output more since everything inside will be consistent across different runs!…”

On the A10G side: “etc… This also means you can be more confident when asking questions specifically related to topics covered within those texts…”

4. T4 Colab vs T4 SageMaker.

For those wondering if two environments with the same GPU even yield the same results, I conducted a test with a free version of Colab notebook, which is equipped with a T4, and I also launched an ml.g4dn.xlarge (T4) notebook instance on SageMaker. The result is indeed identical.

5. Why are the answers generated by the same inputs and the same LLM so different across two GPUs?

The answers end up being quite different due to the autoregressive nature of the LLMs. Since the next token is chosen based on the previous ones, any tiny change causes a cascading reaction, leading to a butterfly effect.

Note that the answers aren’t based on the provided context as requested in the prompt. The LLM didn’t fully follow the instructions, but that’s not very important.

Because we set up the LLM to always choose the most probable token, we can be sure that the difference lies in how the probabilities are calculated across GPUs. Let’s examine those probabilities.

6. Exploring Probabilities.

To print the probability for each chosen token, we’ll bypass the pipeline and use the tokenizer and model.generatemethod directly. This allows us to set return_dict_in_generate=True and output_scores=True. We can then compute, normalize, and transform the transition scores into probabilities.

inputs = tokenizer([prompt], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.00000000001, repetition_penalty=1.4, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

input_length = inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string  | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {np.exp(score.numpy()):.2%}")

The code above will print each token’s ID, the decoded token, and its probability. I’ll include only the relevant part of the output, as the full output is quite lengthy.

T4 Output:

# T4
token id| token str | probability
----------------------------------
...
|  4345 | etc       | 35.28%
|   568 | ..        | 44.56%
|   851 | This      | 36.57%
|   835 | also      | 30.27%
|  2825 | means     | 38.98%
|   368 | you       | 24.24%
|   541 | can       | 46.44%
|  4893 | trust     | 18.74%
|   767 | what      | 29.62%
|  3435 | comes     | 44.17%
|   575 | out       | 40.51%
...

A10G Output:

# A10G 
token id| token str | probability
----------------------------------
...
|  4345 | etc       | 35.48%
|   568 | ..        | 44.38%
|   851 | This      | 36.40%
|   835 | also      | 30.22%
|  2825 | means     | 39.42%
|   368 | you       | 24.29%
|   541 | can       | 46.42%
|   347 | be        | 18.62%
|   680 | more      | 49.45%
| 10689 | confident | 57.50%
...

Okay, now it’s getting interesting. The probabilities for T4 and A10G are not exactly the same. Usually, this doesn’t affect the token ranking (you won’t notice anything in the generated sequences), but sometimes it does.

For example, on T4, “trust” has an 18.74% probability, while on A10G, “be” is favored at 18.62%. From this point, the generation will diverge due to the LLM’s autoregressive nature.

Note: Quantizing the LLM reduces calculation precision, making these differences more frequent.

Now, a legitimate question to ask is, ‘Why does the calculation differ depending on the GPU?’

7. Why do the calculation differ depending on the GPU ?

I’m not a CUDA expert but I’ve done my research. The differing calculations between GPUs can be attributed to several factors:

Parallel Computation Handling:
GPUs are all about handling a lot of computation in parallel efficiently. However, different GPUs may vary in managing parallel tasks, affecting the order of operations and memory access.

This matters because even simple addition of numbers with vastly different magnitudes can be non-associative in programming, leading to potential inaccuracies in precise calculations. Non-associativity occurs when

(a + b) + c ≠ a + (b + c).

So computations are divided, processed independently, and then combined in a non-associative manner, as a consequence, the way these parts are recombined affects the final result.

Here’s a simple example of a non-associative computation:

import torch
# Define three floating-point numbers in bfloat16 with a large difference in magnitude
a = torch.tensor(1e10, dtype=torch.bfloat16)
b = torch.tensor(-1e10, dtype=torch.bfloat16)
c = torch.tensor(1.0, dtype=torch.bfloat16)

# Calculate the sums in different orders
sum1 = (a + b) + c
sum2 = a + (b + c)
# Print the results in bfloat16
print(f"(a + b) + c in bfloat16: {sum1}")
# >>> 1.0
print(f"a + (b + c) in bfloat16: {sum2}")
# >>> 0.0

With LLMs, millions of calculations can lead to divergence due to small repeated inaccuracies, influencing word choice during sequence generation.

Hardware Architecture:
Different GPU models, such as the Nvidia Tesla T4 and Nvidia A10G, have different hardware architectures. These architectures are designed to optimise various aspects of performance, including parallel processing capabilities, memory bandwidth, and compute units.

For instance, the T4 uses the Turing architecture, while the A10G is based on the Ampere architecture.

Different architectures means different implementations for floating-point arithmetic, memory access patterns, and other low-level operations. Even slight differences in these implementations can lead to variations in computation results.

For instance, an architecture optimized for higher precision may yield different results compared to one optimized for speed, even if both are performing the same floating-point operations.

Quantization Effects:
Quantizing a model reduces its precision to save memory and computational resources, but it also introduces additional sources of error. The impact of these errors can vary depending on the GPU’s handling of lower precision arithmetic.

Since quantization involves approximating numbers, different GPUs may handle these approximations differently, leading to variations in the probabilities of token predictions.

8. Should I be concerned about scaling an LLM horizontally using multiple GPUs?

That’s an excellent question, thank you for asking! :)
If you are simply adding multiple instances of the same GPU (for example, scaling from a single A10G GPU to an instance with 4xA10G GPUs), is there a need for concern?

Well, there are several strategies when it comes to using multiple GPUs for inference:

The first strategy, when your model fits on one GPU, is to load a copy of the model on each GPU. For instance, if you send a list of four queries to the pipeline, each query might be processed by a different GPU. This means you’ll see the same output as you would using only one GPU, but with improved throughput.

A second strategy, typically used when a model can’t fit on one GPU, is sharding, which splits the model’s weights across GPUs. While theoretically, this can cause variations due to differences in computation distribution and execution, in practice, at least on my tests, using sharding produced sequences and probabilities identical to those from a single GPU, my guess is that PyTorch strives for deterministic operations.

Conclusion:

We’ve shown that different GPUs can cause the LLM to output different results, even with the same environment, settings, and seed. This variability increases with longer prompts because they require more calculations, increasing the propagation of inaccuracies and promoting divergences between two GPUs. Furthermore, this effect is more pronounced with quantization.

I’m not saying that the consequences will always be catastrophic but it’s something you should keep in mind when dealing with LLM deployments.

If you develop with a different GPU than the one used in production, set up tests to ensure performance remain acceptable. This is also important if you plan to scale the LLM to a new instance that has different GPU(s).

If you made it to the end, congrats! I hope you enjoyed this article. It’s my first one on Medium, so if you liked it I would appreciate it if you upvote to encourage me to write more. Also feel free to share your thoughts in the comments.