Image created by author using Dalle-3 via Bing Chat

Speculative Decoding — Make LLM Inference Faster

Improve LLM inference speed by 2–3X without degrading any accuracy

Published in

AI Science

6 min readApr 8, 2024

In this blog, we’ll discuss about Speculative Decoding in detail which is a method to improve LLM inference speed by around 2–3X without degrading any accuracy. We’ll also look into implementing Speculative Decoding and and see how fast it is compared naive transformer implementation.

Using speculative decoding can speed up the process of generating text without changing the final result. Speculative decoding involves running two models parallel , which has shown to promise a speed increase of 2–3 times for language model inference.

Naive Autoregressive Sampling

Autoregressive Sampling

The standard way of generating text from a language model is with autoregressive sampling where decoding K tokens takes K serial runs of the model.

Inference from large autoregressive models like Transformers is slow — decoding K tokens takes K serial runs of the model.

In code:

def generate(prompt: str, tokens_to_generate: int) -> str:
    tokens = tokenize(prompt)
    for i in range(tokens_to_generate):
        next_token = model(tokens)
        tokens.append(next_token)
    return detokenize(tokens)

Speculative Decoding

Using a new method called Speculative Decoding could make our language model (LLM) work much faster without changing its results. Speculative Decoding that promising 2–3X speedups of LLM inference by running two models in parallel.

Target Model: The main LLM we want to use for our task.
Small Draft Model: A smaller, lightweight LLM that runs alongside to help speed up the main LLM’s inference process.

Note

The target model and the draft model must both use the same tokenizer.

There are two key idea behind Speculative Decoding:

In the below following image, predicting a token ‘of ‘ is really easy and it probably get easily predicted by much smaller model, therefore using a smaller model to predict the easy tokens and the use the big model only for predicting more difficult tokens.

predicting a token ‘of ‘ is really easy and can easily predicted by much smaller model while prediction of token ‘**Edinburg**’ is difficult comparatively which smaller model might not able to predict

2. The second idea takes advantage of how Transformer models work. Even though these models typically generate one word at a time, they can process multiple tokens at once. While generating the next token, they can check all tokens in the sequence at once. It does this by calculating the probability for each token in the sequence. In the example below, smaller model predicts “Toronto” but the correct word is “Edinburgh” the bigger model can see that the probability of “Toronto” is low and correct it to “Edinburgh”.

smaller model predicts “**Toronto**” but the correct word is “**Edinburgh**” the bigger model can see that the probability of “Toronto” is low and correct it to “Edinburgh”.

How Speculative Decoding Works?

At the heart of Speculative Decoding approach lie the observations that:

Hard language-modeling tasks often include easier subtasks that can be solved well by more efficient lightweight models which take very less time to execute.
When performing inference on an LLM, speculative decoding uses a smaller draft model which generates speculative tokens and then target LLM verify those draft output tokens generated by smaller draft model.
With speculative execution, exact decoding from the large model can be generated faster. This works by running the larger model at the same time on the rough guesses from smaller model. This means we can generate several tokens in one forward pass of larger model, without changing the output distribution.

The speedup provided by speculative decoding heavily depends on the choice of the draft model.

Core Idea of Speculative Decoding

The core idea is:

(1) use the more efficient small model Mq to generate γ completions.

(2) Use the target model Mp to evaluate all of the guesses and their respective probabilities from Mq in parallel, accepting all those that can lead to an identical distribution.

(3) Sampling an additional token from an adjusted distribution to fix the first one that was rejected, or to add an additional one if they are all accepted. That way, each parallel run of the target model Mp will produce at least one new token (so the number of serial runs of the target model can never, even in the worst case, be larger than the simple autoregressive method), but it can potentially generate many new tokens, up to γ + 1, depending on how well Mq approximates Mp

Image Source: Accelerating Large Language Model Decoding with Speculative Sampling

Experiments

Evaluation setup of Speculative Decoding

I’ve used NVIDIA RTX 6000 Ada Generation GPU for running and evaluating LLM.

Code Implementation

Code for implementing Speculative Decoding and Benchmark with autoregressive sampling.

I’ve used Speculative-Sampling: main GitHub Repository for implementing and benchmarking the results. I found this repository to be very good wrt Speculative Decoding implementation.

Speculative Decoding with Huggingface Transformers

Speculative Decoding could be easily implemented with Huggingface transformer library. Checkout the following code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

Parameters:

num_assistant_tokens: Defines the number of _speculative tokens_ that shall be generated by the assistant or draft model before being checked by the target model at each iteration. Higher values for `num_assistant_tokens` make the generation more _speculative_ : If the assistant model is performant larger speed-ups can be reached, if the assistant model requires lots of corrections, lower speed-ups are reached.

Benchmark Results

Note
The target model and the draft model must both use the same tokenizer.

Results showing the speedup (as ratio) of speculative sampling over naive autoregressive sampling when experimenting with different Target and Draft models.

Results showing the speedup (as ratio) of speculative sampling over naive autoregressive sampling. These results are from different benchmarking runs.

Above benchmark results shows that Speculating Decoding gives around 2X speed increase over naive autoregressive sampling.

Conclusions

Speed increase is around 2X in most of the cases based on our results.
The draft model must be significantly smaller(10–50X smaller) to achieve inference acceleration
The speedup ratio seems to increase as the target model size increases (and when the draft model is also relatively big enough). So, the speedup ratio of 2–2.5x mentioned in the Deepmind paper, could also be true for a 70B target model and a 7B draft model.
However, there are some instances where the speed actually slows down. We need to look into these cases more closely. It could be because different models require specific prompt formats?

Future Work

1. Benchmark with different lookheads

2. vLLM AND Speculative decoding Integration

Integrating Speculative decoding in vLLM is in progress (I believe it’s in final stage) , I believe vLLM and Speculative decoding will give huge speed-up.

PR: vllm-public: [WIP] Speculative decoding using a draft modelOPEN

3. Medusa

It’s also a speculative sampling but instead of using draft model, it trains few new parameters to predict multiple future tokens.

GitHub: Medusa: main

arXiv

Medusa adds extra “heads” to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training. During generation, these heads each produce multiple likely words for the corresponding position.

It solve the challenges associated with speculative decoding by implementing the following ideas:

Instead of introducing a new model, we train multiple decoding heads on the same model.
The training is parameter-efficient so that even the “GPU-Poor” can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.
Relaxing the requirement of matching the distribution of the original model makes the non-greedy generation even faster than greedy decoding