Insights into Consistency Large Language Models (CLLMs)

Joe El Khoury - GenAI Engineer
10 min readMay 12, 2024

--

In the rapidly evolving field of artificial intelligence, the development of large language models (LLMs) like GPT (Generative Pre-trained Transformer) has dramatically changed how machines understand and generate human-like text. Traditionally, these models have decoded text sequentially, one token at a time. This approach, while effective, introduces significant delays, especially for longer texts. However, groundbreaking work by research teams from Hao-ai-lab in this paper, has introduced a more efficient method known as Consistency Large Language Models (CLLMs), which promises to revolutionize this process by enabling parallel decoding.

For more info please check this Github link

Traditional Language Models and Their Limitations

Traditionally, large language models process text autoregressively, decoding each token sequentially based on the previous tokens. This method mimics how humans read and write — predicting the next word based on the previous context. However, this sequential processing can be slow, particularly for generating longer text sequences due to the linear nature of token generation:

Word_{n+1} = f(Word_1, Word_2, ..., Word_n)

where f is the predictive function of the model. This approach, known as autoregressive (AR) decoding, is visualized traditionally as generating one token at a time from a given sequence.

Breakthrough with CLLMs: Parallel Decoding

The new family of models, called Consistency Large Language Models (CLLMs), introduces an efficient way to decode multiple tokens in parallel at each step. This parallel processing significantly reduces the overall steps needed to generate text by mimicking how humans think about entire phrases or sentences before beginning to write.

Conceptual Overview of CLLMs

CLLMs leverage the idea that pretrained LLMs can be taught to become efficient parallel decoders. By predicting chunks of words together, CLLMs minimize the latency involved in the text generation process. This approach can be conceptualized as:

Sequence_{n+1:n+m} = f(Sequence_1, Sequence_2, ..., Sequence_n)

where m is the number of tokens decoded in parallel, and f represents the CLLM’s predictive function, capable of handling multiple tokens at once.

Technical Deep Dive: Jacobi Decoding in CLLMs

One of the pivotal innovations in CLLMs is utilizing a technique known as Jacobi Decoding, inspired by numerical methods for solving systems of linear equations. This method allows the model to update its predictions for an entire block of words concurrently, rather than one word at a time.

Mathematical Formulation

Jacobi decoding transforms the sequential generation process into a parallel process by treating it as a system of nonlinear equations. Suppose you have a sequence to decode; instead of processing it word by word, you solve:

x_{k+1} = T * x_{k} + c

where:
- x^{(k)} : Vector of tokens at iteration k
- T : Transformation matrix representing the transition probabilities between tokens
- c : Constant vector

In the context of CLLMs, this process is adapted to handle multiple tokens, leading to:

X_{n+1:n+m}^{k+1} = T * X_{n+1:n+m}^{k} + C

where:
- X_{n+1:n+m}^{(k)} : Block of m tokens starting from n+1 at iteration k
- C : Adjusted constant vector for parallel decoding

Detailed Example of Jacobi Decoding

To better understand how Jacobi decoding works within CLLMs, let’s delve into a more detailed example. Consider the phrase “The quick brown fox jumps over the lazy dog.” Suppose we want to predict the sequence starting from “The quick brown …”. Instead of predicting each subsequent word one at a time, we aim to predict multiple words in one go.

1. Initialization: Start with the sequence “The quick brown ??? ??? ???”. Here, ‘???’ indicates unknown tokens that the model aims to predict.


# Starting sequence
initial_sequence = ["The", "quick", "brown", "???", "???", "???"]

2. First Jacobi Step:
The model looks at “The quick brown” and predicts the next three words in parallel.
Assume the model predicts “fox”, “jumps”, and “over” in this step based on the probabilities and context it has learned during training.


# First Jacobi iteration
predicted_sequence_1 = ["The", "quick", "brown", "fox", "jumps", "over"]

3. Refinement:
If the sequence has not converged (meaning the predicted words do not change anymore or meet a certain threshold of confidence), the model refines these predictions.
For instance, in the next step, based on “The quick brown fox jumps over”, it might adjust “over” to “the”.


# Second Jacobi iteration
predicted_sequence_2 = ["The", "quick", "brown", "fox", "jumps", "the"]

4. Convergence:
This process continues until the model’s predictions stabilize, producing “fox jumps over” after a few iterations, ideally with fewer steps than traditional sequential prediction.


# Third Jacobi iteration
predicted_sequence_final = ["The", "quick", "brown", "fox", "jumps", "over"]

The mathematical representation can be expanded as follows:

X_{4:6}^{0} = [
"???",
"???",
"???"
]

After the first Jacobi iteration:

X_{4:6}^{1} = T * X_{4:6}^{0} + C

where T and C are adjusted based on the context “The quick brown”. If the system converges after k steps:

X_{4:6}^{k} = [
"fox",
"jumps",
"over"
]

Implementation of Jacobi Decoding

Below is the implementation of Jacobi decoding using a simple neural network model for illustrative purposes.

import torch
import torch.nn as nn
import torch.nn.functional as F

class JacobiDecoder(nn.Module):
def __init__(self, input_size, hidden_size, output_size, max_iters=10):
super(JacobiDecoder, self).__init__()
self.hidden = nn.Linear(input_size, hidden_size)
self.output = nn.Linear(hidden_size, output_size)
self.max_iters = max_iters

def forward(self, input_sequence, initial_predictions, T, C):
"""
Jacobi decoding process:
- input_sequence: tensor containing input context
- initial_predictions: initial predictions of unknown tokens
- T: transition matrix
- C: constant vector
"""
predictions = initial_predictions
for _ in range(self.max_iters):
hidden_state = F.relu(self.hidden(torch.cat((input_sequence, predictions), dim=1)))
predictions = T @ hidden_state + C

logits = self.output(predictions)
return logits

Improvements and Gains with CLLMs

The research teams from Shanghai Jiao Tong University and the University of California have shown that pretrained LLMs can easily be taught to become efficient parallel decoders. This new family of parallel decoders, known as Consistency Large Language Models (CLLMs), can reduce inference latency by efficiently decoding an n-token sequence at each inference step.

In their paper, the researchers state:
“Mimicking the cognitive process where humans form complete sentences in their minds before expressing them word by word can be effectively learned by simply fine-tuning pretrained LLMs.”

Specifically, CLLMs are trained to decode in parallel by mapping any randomly initialized n-token sequence to produce the same results as autoregressive (AR) decoding in the least number of steps possible.

Experimental Results

Experiments demonstrate that CLLMs can achieve a 2.4 to 3.4 times improvement in generation speed over traditional models. This is achieved without increasing the computational overhead, a significant advantage over other fast inference methods like Medusa2 and Eagle, which often require additional memory or computational resources.

Practical Applications and Broader Implications

CLLMs are not just faster; they maintain the high quality of text generation, ensuring the generated responses are contextually relevant and coherent. This balance of speed and quality has several implications:

1. Interactive AI Systems: Faster text generation makes CLLMs ideal for real-time applications such as interactive chatbots and digital assistants.
For example, achatbot using CLLMs could respond to user queries almost instantaneously, improving user experience and engagement.

2. Efficient Computing: By processing multiple words at once without extra memory overhead, CLLMs are more resource-efficient, making them suitable for deployment in environments with limited computational capacity.
For example, this is beneficial for mobile devices and embedded systems where resource conservation is crucial.

3. High Quality: The ability to parallel process while maintaining text quality means that CLLMs can be used in sensitive applications like medical diagnosis or legal advice where accuracy is paramount.
For example, in a medical diagnosis application, CLLMs can quickly generate comprehensive reports based on symptoms described by a patient, ensuring timely and accurate diagnostic outcomes.

Real-World Application Example

Consider an AI-powered customer service chatbot using a traditional LLM. If a user asks a complex question, the bot might take several seconds to generate a response, leading to a subpar user experience. With CLLMs, the same bot could generate responses in a fraction of the time, significantly improving interaction quality.

For instance, if a user asks, “What are the store hours and do you have the new iPhone in stock?” a traditional model might sequentially predict the answer, while a CLLM could generate a complete response like “Our store is open from 9 AM to 9 PM, and yes, the new iPhone is available” much more rapidly.

Jacobi Decoding in CLLMs: Technical Insights

The researchers demonstrated that Jacobi Decoding can be used to transform the sequential generation process into a parallel process by treating it as a system of nonlinear equations. Here’s a detailed breakdown:

1. Initial Guess:
Given a prompt x and a pretrained LLM p(\cdot|x), start by randomly guessing the next sequence of tokens (an n-token sequence, unless otherwise stated).
Then, feed the n-token sequence along with the prompt into the LLM for iterative updates.

2. Iteration and Refinement:
Continue the updates until the n-token sequence stabilizes and reaches a fixed point.
The Jacobi decoding process iteratively refines predictions until convergence.

3. Fixed Point and Jacobi Trajectory:
The n-token sequence eventually converges to the output generated by AR decoding with a greedy strategy.
The trajectory from the initial guess to the final AR result is called the “Jacobi Trajectory”.

4. One-Step Convergence:
By training the model using a loss function that encourages single-step convergence, the researchers were able to minimize the number of iterations required for convergence.

Example Implementation of Jacobi Decoding in CLLMs


import torch
import torch.nn as nn
import torch.nn.functional as F

class JacobiDecoder(nn.Module):
def __init__(self, input_size, hidden_size, output_size, max_iters=10):
super(JacobiDecoder, self).__init__()
self.hidden = nn.Linear(input_size, hidden_size)
self.output = nn.Linear(hidden_size, output_size)
self.max_iters = max_iters

def forward(self, input_sequence, initial_predictions, T, C):
"""
Jacobi decoding process:
- input_sequence: tensor containing input context
- initial_predictions: initial predictions of unknown tokens
- T: transition matrix
- C: constant vector
"""
predictions = initial_predictions
for _ in range(self.max_iters):
hidden_state = F.relu(self.hidden(torch.cat((input_sequence, predictions), dim=1)))
predictions = T @ hidden_state + C

logits = self.output(predictions)
return logits

Limitations of Jacobi Decoding

In practice, plain Jacobi Decoding accelerates LLMs only marginally, with an average speedup of just 1.05x. This is because it is challenging for the LLM to generate the correct tokens if errors exist in previous tokens.

1. Inaccurate Tokens: Most Jacobi iterations can only correct one token in the sequence, leading to longer trajectories.
2. Lookahead and Speculative Decoding: These methods attempt to alleviate the inefficiencies of Jacobi Decoding and traditional AR Decoding but require additional memory.

Overcoming Limitations with CLLMs

CLLMs address the limitations of Jacobi Decoding by introducing Consistency Training:

1. Jacobi Trajectory Preparation:
For each prompt, the researchers truncated each token sequentially using Jacobi Decoding until the entire response sequence was generated.
Each sequence generated along the trajectory is counted as a data entry.

2. Training with Consistency and AR Losses:
The researchers combined two losses to fine-tune CLLMs:

  • Consistency Loss: Ensures that multiple tokens are predicted at once.
  • AR Loss: Prevents CLLMs from deviating from the target LLM to maintain generation quality.

class ConsistencyLoss(nn.Module):
def __init__(self, kl_weight=0.1):
super(ConsistencyLoss, self).__init__()
self.kl_weight = kl_weight

def forward(self, predictions, target_distributions):
"""
Compute the consistency loss using KL divergence.
- predictions: tensor of predicted logits
- target_distributions: tensor of target distributions (from the target model)
"""
kl_div = F.kl_div(F.log_softmax(predictions, dim=-1), target_distributions, reduction='batchmean')
return self.kl_weight * kl_div

class ARLoss(nn.Module):
def forward(self, predictions, targets):
"""
Compute the AR (autoregressive) loss using cross-entropy.
- predictions: tensor of predicted logits
- targets: tensor of target tokens
"""
return F.cross_entropy(predictions, targets)

Experimental Results

The research team evaluated CLLMs on three specific domains:
1. Spider (Text-to-SQL)
2. Human-Eval (Python Code Completion) and GSM8k (Mathematics)
3. MT-bench (Open-Domain Conversations)

They used fine-tuned encoder LLMs, Deepseek-coder-7B-instruct, LLaMA-2–7B, or ABEL-7B-001 as target models, depending on the task. Both training and evaluation were conducted on NVIDIA A100 40GB servers.

Key Findings:

- Accelerated Inference: CLLMs offer a 2.4x to 3.4x speed improvement over target models and other benchmarks (Medusa2, Eagle).
- Efficient Memory Usage: No additional inference costs were incurred.

Conclusion: The Future is Parallel

The introduction of Consistency Large Language Models marks a significant evolution in the field of natural language processing. By harnessing the power of parallel decoding and Jacobi iteration, CLLMs offer a way to dramatically speed up text generation without compromising on quality. This makes them an exciting development for AI engineers and architects looking to build the next generation of intelligent applications.

As AI continues to integrate into various aspects of life, models like CLLMs ensure that systems can keep up with the demand for fast, reliable, and intelligent text generation, paving the way for more sophisticated and interactive AI systems. The future of language processing is parallel, and CLLMs are leading the way, promising enhancements in AI that could transform user interactions across various domains.

References

I’m Joe, and my ambition is to lead the way to industry 5.0 performance. I’m always interested in new opportunities, so don’t hesitate to contact me on my LinkedIn.

--

--

Joe El Khoury - GenAI Engineer

Generative AI Engineer at OnePoint France Leading the way to Industry 5.0 performance