NLP LLM Context Length Extension

Amanatullah
The Deep Hub
Published in
10 min readFeb 26, 2024
  • The increasing application of Large Language Models (LLMs) across sectors has highlighted a significant challenge: their predefined context lengths. This limitation impacts efficiency, especially when applications require the processing of extensive documents or large-scale conversations. While directly training LLMs on longer contexts is a potential solution, it is not always efficient and can be resource-intensive.
  • This article will discuss various methods aimed at enhancing the context length of LLMs.
  • Context length serves as a vital parameter determining the efficacy of LLMs. Achieving a context length of up to 100K is notable. The value of such an achievement, however, might be perceived differently with the progression of time and technology.
  • One primary use-case for LLMs is to analyze a large set of custom data, such as company-specific documents or problem-related texts, and to answer queries specific to this dataset, rather than the generalized training data.
  • Let’s take a look at existing Solutions to Address Context Length Limitations:
  • Summarization & Chained Prompts: Current approaches often involve the use of sophisticated summarization methods coupled with chained prompts.
  • Vector Databases: These are used to store embeddings of custom documents, which can then be queried based on similarity metrics.
  • Fine-Tuning with Custom Data: This method, while effective, is not universally accessible due to restrictions on certain commercial LLMs and complexities with open-source LLMs.
  • Customized LLMs: Developing smaller, data-centric LLMs is another solution, though it presents its own set of challenges.
  • Advantages of Extended Context Length:
  • An LLM with an expanded context length can offer more tailored and efficient interactions by processing user-specific data without the need for model recalibration. This on-the-fly learning approach, leveraging in-memory processing, has the potential to enhance accuracy, fluency, and creativity.
  • Analogy for Context: Similar to how computer RAM retains the operational context of software applications, an extended context length allows an LLM to maintain and process a broader scope of user data.
  • In this article, we aim to present a detailed examination of methods focused on increasing the context length, emphasizing their practical implications and benefits.

NTK-Aware Method

  • This method addresses the issue of extending the context window by considering the impact on the model’s ability to handle high-frequency components in the data.
  • In neural networks, especially those used in language processing, high-frequency components are crucial for capturing fine-grained details and nuances in the text.
  • The “NTK” in “NTK-Aware” refers to the Neural Tangent Kernel, a theoretical framework that describes how the output of a neural network changes in response to small changes in its parameters.
  • When extending the context window, the NTK-Aware method makes adjustments to the model to ensure it doesn’t lose its sensitivity to these high-frequency components. This could involve tweaking the weights or architecture of the network in a way that compensates for the potential loss of detail that could occur when processing longer sequences.

Dynamic NTK Method

  • While the NTK-Aware method makes static adjustments based on NTK theory, the Dynamic NTK method takes this a step further by making these adjustments adaptable to the length of the input sequence.
  • This means that the model doesn’t just have a fixed setting for extended contexts; instead, it dynamically alters its processing based on the actual length of the sequence it’s dealing with at any given moment.
  • For instance, if the model is handling a sequence close to its original training length, the adjustments might be minimal. But as the sequence length increases, the Dynamic NTK method scales the adjustments accordingly.
  • This dynamic scaling is likely achieved by altering the model’s internal parameters, such as attention weights or other factors that influence how it processes different parts of the input.
  • The NTK-Aware method is about making specific adjustments to preserve the model’s ability to process high-frequency information in extended contexts, while the Dynamic NTK method allows these adjustments to vary depending on the length of the input, providing a more flexible and efficient way to handle varying context sizes.

Extending Context Window of Large Language Models Via Position Interpolation

  • This paper introduces a technique called Position Interpolation (PI) to extend the context length of large language models (LLMs) like LLaMA without compromising their performance.
  • LLMS use positional encodings, such as RoPE, to represent the order of tokens in a sequence. However, naively fine-tuning these models on longer contexts can be slow and ineffective, especially when extending the context length by a large factor (e.g., 8 times).
  • The key insight behind PI is that extrapolating positional encodings beyond the trained range can result in unstable and out-of-distribution attention scores. Instead, PI interpolates between the trained integer steps to create smooth and stable positional encodings.
  • To do this, PI downscales the positional indices before computing the positional encodings. For instance, if the original context length is 4096, PI rescales the indices from [0, 4096] to [0, 2048], matching the original length. This effectively interpolates the positional encodings between the original integer steps, reducing the maximum relative distance and making the attention scores more stable.
  • During fine-tuning, the model adapts quickly to the interpolated positional encodings, which are more stable than extrapolated ones. The authors prove that the interpolated attention score has a much smaller upper bound than the extrapolated attention score, ensuring that the model’s behavior remains consistent and predictable.
  • Experiments demonstrate that PI successfully extends models like LLaMA-7B to handle context lengths of up to 32768 with only 1000 training steps. Evaluations on various tasks, such as language modeling, question answering, and retrieval, confirm that the extended models effectively leverage long contexts without sacrificing performance on shorter contexts.
  • Thus, Position Interpolation offers a simple yet effective way to extend the context length of LLMs like LLaMA. By downscaling positional indices and interpolating between trained integer steps, PI creates smooth and stable positional encodings, enabling models to adapt to longer contexts without losing stability or performance.
  • The technique was originally proposed by u/emozilla on Reddit as “Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning” and allows us to scale out the context length of models without fine-tuning by dynamically interpolating RoPE to represent longer sequences while preserving performance.
  • While it works well out of the box, performance can be further improved by additional fine-tuning. With RoPE scaling, companies can now easily extend open-source LLMs to the context lengths which work for their given use case.
  • From the Reddit post:
  • “When u/kaiokendev first posted about linearly interpolating RoPE for longer sequences, I (and a few others) had wondered if it was possible to pick the correct scale parameter dynamically based on the sequence length rather than having to settle for the fixed tradeoff of maximum sequence length vs. performance on shorter sequences. My idea was to use the exact position values for the first 2k context (after all, why mess with a good thing?) and then re-calculate the position vector for every new sequence length as the model generates token by token. Essentially, set scale to original model context length / current sequence length. This has the effect of slowly increasing scale as the sequence length increases.
  • I did some experiments and found that this has very strong performance, much better than simple linear interpolation. When u/bloc97 posted his NTK-Aware method, it was much closer to this dynamic linear scaling in terms of performance. Compared to dynamic linear scaling, NTK-Aware has higher perplexity for shorter sequences, but better perplexity at the tail end of the sequence lengths. Unfortunately, it also suffers from catastrophic perplexity blowup, just like regular RoPE and static linear scaling.
  • The main hyperparamter of NTK-Aware is α�. Like static linear scaling, it represents a tradeoff between short/long sequence performance. So I thought, why not use the same dynamic scaling method with NTK-Aware? For Dynamic NTK, the scaling of α� is set to (α� * current sequence length / original model context length) — (α� — 1). The idea again is to dynamically scale the hyperparameter as the sequence length increases.
  • This uses the same methodology as NTK-Aware (perplexity on GovReport test). You can check out all the code on GitHub.”
  • Hugging Face Transformers now supports RoPE-scaling (rotary position embeddings) to extend the context length of large language models like LLaMA, GPT-NeoX, or Falcon.
  • So in essence, RoPE scaling dynamically rescales relative position differences based on the input length, analogous to a rope stretching and contracting.

Deeper Dive Into How LLama 2’s Context Window Increased

  • Why LLama 2 is a Preferred Choice for Large Context Windows:
  • LLama 2, despite initially appearing to have a smaller context window size (4096 tokens or approximately 3000 words) compared to models like ChatGPT, GPT-4, and Claude 2, offers significant advantages due to its open-source nature and the innovative use of Rotary Positional Embeddings (RoPE).
  • Understanding the Typical Transformer Architecture:
  • Most transformer models, including LLama 2, consist of:
  1. Embeddings: Used to encode the text input.
  2. Transformer Blocks: Execute the primary processing tasks.
  3. Prediction Head: Tailored to the learning task at hand.
  • The context size, or the amount of text the model can consider at once, is defined by the size of the positional embedding, which combines with the text embedding matrix to encode text.
  • Rotary Positional Embeddings (RoPE) in LLama 2:
  • LLama 2 uses Rotary Positional Embeddings (RoPE), distinguishing it from models that use typical sine function encoding. This method modifies each attention layer in such a way that the computed attention between input tokens is solely dependent on their distance from each other, rather than their absolute positions in the sequence. This relative positioning allows for more flexible handling of context windows.
  • Extending the Context Window with Interpolation:
  • Meta, the developer of LLama 2, employs a technique to extend the context window by interpolating at non-integer positions, allowing the model to process text inputs much larger than its original window size, maintaining its performance level.

Implementation:

  • The practical implementation of extending the context window involves rescaling the integer positions, and a minor modification in the model’s code can accomplish this. Despite the model not being initially trained for extended position embedding, it can be fine-tuned to adapt to the new context window and can dynamically adjust to the user’s needs, especially when it’s used to fine-tune on private data.
  • LLama 2’s approach to positional embeddings and its open-source nature make it a versatile choice for tasks requiring large context windows. With simple modifications and fine-tuning, it can adapt to varying needs while maintaining optimal performance, proving to be a highly flexible and efficient model. The research and methodology involved can be further explored in the provided method link.

Interpolation and How It Increases Context Length

  • What is interpolation at non-integer positions?
  • Interpolation is a mathematical method to determine unknown values between two known values. In the context of the LLama 2 model, “interpolating at non-integer positions” refers to a technique where the positions of tokens (pieces of data or text) are adjusted to fit within the model’s original context window, even if the data extends beyond it.
  • Instead of using whole numbers (integers) for positions, this method utilizes values between whole numbers (non-integers).

Why do they do this?

  • By using this interpolation method, LLama 2 can process text inputs that are much larger than its designed capacity or its original window size.

What is the benefit?

  • The advantage of this technique is that despite processing larger chunks of data, the model doesn’t suffer in performance. It can handle more data while still operating effectively.
  • In simpler terms: Meta has used a clever method to let LLama 2 handle more data at once without slowing it down or reducing its effectiveness. They achieve this by adjusting the way data positions are calculated, using values between whole numbers.
  • The image below by Damien Benveniste illustrates the interpolation concept in detail.

LongLora: Efficient Fine-tuning of Long-context Large Language Models

  • The paper proposes an efficient method called LongLoRA to fine-tune large pre-trained language models like LLaMA2 to much longer context lengths, while retaining their original architectures. The key ideas are:
  1. Shift Short Attention (S2-Attn): During fine-tuning, standard full self-attention is very costly for long contexts. S2-Attn approximates the full attention using short sparse attention within groups of tokens. It splits the sequence into groups, computes attention in each group, and shifts the groups in half the heads to allow information flow. This is inspired by Swin Transformers. S2-Attn enables efficient training while allowing full attention at inference.
  2. Improved LoRA: Original LoRA only adapts attention weights. For long contexts, the gap to full fine-tuning is large. LongLoRA shows embedding and normalization layers are key. Though small, making them trainable bridges the gap.
  3. Compatibility with optimizations like FlashAttention-2: As S2-Attn resembles pre-training attention, optimizations like FlashAttention-2 still work at both train and inference. But many efficient attention mechanisms have large gaps to pre-training attention, making fine-tuning infeasible.
  4. Evaluation: LongLoRA extends the context of LLaMA2 7B to 100k tokens, 13B to 64k tokens, and 70B to 32k tokens on one 8x A100 machine. It achieves strong perplexity compared to full fine-tuning baselines, while being much more efficient. For example, for LLaMA2 7B with 32k context, LongLoRA reduces training time from 52 GPU hours to 24 hours.
  • The image below from the original paper displays shift short attention in action.
  • In summary, the key novelty is using shift short attention to enable efficient long context fine-tuning of pre-trained LLMs, while retaining their full attention at inference. Making select small parameter layers trainable is also an important finding. LongLoRA provides an efficient way for researchers to extend LLMs to longer contexts without extensive resources. The compatibility with optimizations is also a notable advantage.

Conclusion

  • So far we’ve seen two methodologies to help extend the context-length of our LLMs. As more research prevails in this domain, we will keep this article updated with its findings!

--

--