Extend the context length of Falcon40B to 10k

11 min readJul 19, 2023

Introducing LongFalcon40B-OA

Chen Wu, Yin Song, Wei Yip Yap, Eden Duthie

AWS Prototyping Team

Introduction

Falcon40B, the state-of-the-art large language model developed by the Technology Innovation Institute (TII), was pre-trained on Amazon SageMaker with 40 billion parameters. It is a leading open-source Large Language Model (LLM) that demonstrates the significant potential of open source community in driving AI innovation. Moreover, Falcon40B’s permissive Apache 2.0 license grants organisations substantial freedom to apply the model in commercial use cases, opening up a range of possibilities for industry adoption. Like most open source LLMs, Falcon40B comes with a pre-defined context window size of 2048. To enhance Falcon40B’s functionality for customers in the telecommunication, finance, digital design, and legal industries, a larger context window size is needed beyond the default 2048 provided in Falcon40B. For instance, it is reasonable to expect an assistant (human or bot) to perform well in reading comprehension, summarisation, and question answering tasks based on a document that is 8,000 words in length.

This blog post showcases how to deploy the Falcon40B Large Language Model with a context window size of 10,000 tokens using the HuggingFace Text Generation Inference (TGI) environment, avoiding the need for fine-tuning. With just a few lines of Python code and a shell script, the Falcon 40B model with the extended input context can be leveraged for inference on lengthy contexts, such as research papers, stories, or novels, efficiently. TGI is HuggingFace’s open source LLM serving platform that powers the Hugging Face LLM Deep Learning Container on Amazon SageMaker.

To demonstrate the long context capability of our model, we work specifically with the Open-Assistant Falcon 40B Model (Falcon40B-OA), a fine-tuned version of the Falcon40B model released by OpenAssistant. Falcon40B-OA is available under the Apache 2.0 license and has proven to be effective in following human instructions and providing detailed, human-like responses.

Problem

What happens to Falcon40B when the input is longer than 2048 tokens?

To answer this question, we use the LLM prompt example given by bloc97 (shown below) to interact with the Falcon40B-OA model hosted on the TGI (v0.9.2) container running on an AWS g5.48x EC2 instance.

You are given this machine learning research paper, please read it carefully and answer the follow up question.
=== BEGIN ===
{document}
=== END OF FILE ===

Please give me a brief summary of this research paper in a few bullet points.

The {document} placeholder is filled with the entire textual content of the arXiv paper [2306.15595] Extending Context Window of Large Language Models via Positional Interpolation. The token count of this prompt is 5,775.

We present the responses from the original Falcon40B-OA and our augmented LongFalcon40B-OA models side-by-side for comparison:

Falcon40B-OA’s response
not at a we" every� own?s and when would made cannot� should don need instead This through another instance). those factors within such? hass become may differentially� other when we not It all may did soа" did been todayет� when perfrom everyt may today an)’ without

LongFalcon40B-OA’s response

This paper presents a method called Position Interpolation (PI) for extending the context window size of large language models such as LLaMA (Touvron et al., 2023).
The main idea of PI is to interpolate the position embeddings of a Transformer model rather than extrapolating them, which can lead to more stable and efficient fine-tuning for longer context windows.
Experiments show that PI effectively extends the context window of LLaMA models to up to 32768 tokens, with improved performance on language modeling and long document summarization tasks.
Compared to direct fine-tuning, PI leads to faster convergence and more stable behavior, with better extrapolation performance.
Overall, this work provides a promising approach for extending the capabilities of large language models to longer sequences without requiring extensive re-training.

While the inference latency (time to the first token) for both models is nearly identical, around 6 seconds, the original Falcon40B-OA failed to produce a usable answer as the context provided far exceeded its window size (5775 vs. 2048). On the other hand, the LongFalcon40B-OA model was able to capture the main points of the paper and provided factual references.

By following the instructions provided in this blog post, readers will be able to deploy the LongFalcon40B-OA model and replicate the results presented above, enabling them to effectively address the context window problem.

Related Work

Recent advancements in LLMs, specifically LLaMA models, have opened new research opportunities for improving their context length and attention capabilities. One such development was first proposed by the SuperHOT author (kaiokendev), who discussed scaling down the frequency window by a factor of 0.25. This interpolation-based approach effectively increased the LLaMA context length by a factor of 4 even without fine-tuning the model, although fine-tuning further boosts the performance as reported by the author.

Building on this idea, Chen, Wong, Chen, and Tian (2023) proposed a similar approach but with a dynamically calculated scaling factor. Instead of using a fixed scale factor, the authors scaled the input position indices in a linear fashion to match the actual sequence length to be generated. Their method provides more flexibility in context length interpolation, as the scale factor is calculated as the ratio between the original maximum context length (L) and the actual sequence length (L’).

Another recent contribution, by bloc97, introduced a nonlinear interpolation scheme using tools from Neural Tangent Kernel (NTK) literature. This approach modifies the base of each RoPE component based on a hyperparameter alpha, allowing the network to better distinguish the order and positions of nearby tokens. The authors argue that simply linearly rescaling the original space (as done in SuperHOT and Chen et al. 2023) is suboptimal as it leads to crowding in the space of high-frequency RoPE components. The NTK-aware interpolation scheme changes the base, effectively changing the “spinning” speed of each RoPE component such that the low frequency components slow down more than the high ones do.

Further extending this idea, emozilla developed a dynamically-scaled NTK-aware method that combines the advantages of both linear and nonlinear interpolation methods. In this approach, the hyperparameter alpha is dynamically calculated as the scale ratio between L’ and L. This allows for improved attention capabilities, while also providing greater.

As Falcon40B is based on the Rotary Embedding, which is also used by the LLaMA model, it is natural to apply the above context expansion methods to this architecture. However, to our best knowledge as of this writing, no open-source implementation exists to extend Falcon40B with dynamic neural tangent kernel (dNTK) optimization for NLP tasks. Therefore, we adopt emozilla’s dynamic NTK-aware approach, integrating dNTK for Falcon40B-OA running on TGI. While the application of dNTK to Falcon40B has not been reported previously, we demonstrate the efficacy of this approach, and present initial experiments using dNTK on Falcon40B-OA.

Quick-start

In this quick-start guide, we will demonstrate how to augment the Text Generation Inference (TGI) Large Language Model (LLM) serving code to enable fast Falcon40B inference with support for long input contexts. We will walk you through the step-by-step process of modifying the built-in RotaryEmbedding layer to improve performance and optimize inference time.

Step 1 — SSH onto a GPU instance (e.g. g5.48x) with sufficient GPU memory and the supporting software libraries described in the TGI Getting Started document. Since the model needs to support longer contexts up to 10,000 tokens, we strongly recommend a total GPU memory capacity of >= 160GB for a 16-bit float deployment. This is based on per-GPU memory consumption as summarised in Appendix D. The high demand for GPU memory is primarily driven by the requirement for a large context size (>10,000) in LLMs. For a quick comparison, a g5.12x instance with a total of 96 GiB GPU RAM is sufficient for running the Falcon40B model with much shorter contexts (< 1500 tokens). If you are comfortable with 8-bit inference, 80GB GPU RAM (e.g. g5.12x) is sufficient even for long contexts. However, the latency of continuous token generation on the 8-bit deployment is much higher even on g5.48x.

Step 2 — We modify the layer python module in the 0.9.2 version of TGI. Our change involves cosine and sine updates code block in the 0.9.2 version of the text-generation-inference model serving framework developed by HuggingFace.

base = self.init_base * \
                (seq_len / self.max_position_embeddings) ** (self.head_dim /
                                                             (self.head_dim - 2))
self.inv_freq = 1.0 / (base ** (torch.arange(0, self.head_dim,
                        2, device=device, dtype=torch.float32) / self.head_dim))

In Appendix A, we will discuss the details of this equation. To get the above code working, we also need to add a few member attributes to the PositionRotaryEmbedding class. Readers can refer to the complete modified layer module file here. This changes should work for any LLMs that (1) use Rotary-based position embedding-based LLMs(GPT-J, GPT-NeoX, LLaMA, etc.) and (2) that have a 2048 maximum token size. The screenshot below highlights all the changes we have made to the layer.py Python module.

Step 3 — Next we build a Docker container based on the TGI v0.9.2. The required Dockerfile for this container image has two lines:

FROM ghcr.io/huggingface/text-generation-inference:0.9.2
COPY layers.py /opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py

Building the new image involves running a one-line shell script, where we name our image falcon-lctx:0.9.2 to synchronise the version id with TGI.

docker build -t falcon-lctx:0.9.2 .

Step 4 — Launch the falcon-lctx:0.9.2 container by running the following example script

export model=OpenAssistant/falcon-40b-sft-top1-560
export num_shard=8
export volume=$PWD/data # share a volume with the Docker container 
export max_input_length=16000 # assume the max_new_tokens =384
export max_total_tokens=16384
export docker_image_id=falcon-lctx:0.9.2 
docker run -d -gpus all -shm-size 1g -p 443:80 -v $volume:/data $docker_image_id \
  --model-id $model -num-shard $num_shard -max-input-length \
  $max_input_length -max-total-tokens $max_total_tokens \
  --max-batch-prefill-tokens $max_total_tokens \
  --max-batch-total-tokens $max_total_tokens \
  --trust-remote-code \
  #-quantize bitsandbytes\

The first line specifies the HuggingFace hub model id, and it can be a file directory where the model has been downloaded locally. For example, if the model is saved under $PWD/data/my-falcon-oa, the model id should be written as:
export model=/data/my-falcon-oa since the directory “$PWD/data” is mapped to the docker image filesystem under the mount point /data. Readers can uncomment the last line in order to load the model with 8bit integers. Our profile suggests that the peak memory consumption per GPU under 16-bit float is around 21 GiB, and 10GiB for 8-bit int.
The launch script will start container as a background daemon. Feel free to use docker logs {docker_ps_id} to monitor the running Docker daemon. The following log messages suggest the TGI server is started successfully.

2023-07-18T04:58:45.604351Z INFO text_generation_launcher: Starting Webserver
2023-07-18T04:58:45.818952Z WARN text_generation_router: router/src/main.rs:324: `--revision` is not set
2023-07-18T04:58:45.818986Z WARN text_generation_router: router/src/main.rs:325: We strongly advise to set it to a known supported commit.
2023-07-18T04:58:45.852549Z INFO text_generation_router: router/src/main.rs:346: Serving revision 7dfbb92e592c6f48f439c6c1472dbb24506384fc of model OpenAssistant/falcon-40b-sft-top1-560
2023-07-18T04:58:45.858712Z INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-18T04:59:06.493858Z INFO text_generation_router: router/src/main.rs:221: Connected
2023-07-18T04:59:06.493890Z WARN text_generation_router: router/src/main.rs:226: Invalid hostname, defaulting to 0.0.0.0

Step 5 — Follow the CURL command or Python API to send your long (or short) prompt to the TGI server, sit back and receive your answer from the LongFalcon40B model within a few seconds! The video below demonstrates a Web UI that uses the TGI Python API to interact with the LongFalcon40B model hosted on the TGI container. The model was tasked with summarising a text excerpt of 10,023 tokens from the novel “Bleak House” by Charles Dickens. It seems that the large language model (LLM) was able to provide a concise and accurate summary of the novel’s background and plot. It is interesting to note that, despite the lack of background provided by the prompt, LongFalcon40-OA has identified the text as being from a novel or other literary work set in 19th century England.

Limitations

We are currently conducting quantitative evaluations on LongFalcon40B-OA against various benchmarks. Our preliminary findings seem to indicate that the model occasionally struggles to extract specific information when the answer lies towards the end of the long context. However, fine-tuning with longer contexts appears to mitigate these issues and lead to more robust results overall. We will report our findings to the community once the benchmarking process is complete.
Our solution utilises HuggingFace’s TGI serving container, which implements state-of-the-art attention mechanisms (e.g., FlashAttention and PagedAttention) to significantly improve latency and reduce memory usage. However, our current implementation supports up to ~10,000 tokens without encountering GPU memory issues (not OOM errors). We will work with the TGI community to resolve this limitation in future iterations.
If we load the LongFalcon40B-OA model and perform inference using the HuggingFace transformers (v4.29.2) with a 4-bit quantisation scheme, it also works and can generate useful results, as shown in the Appendix C. However, it took nearly an hour to complete the answer given the 5500-long input, and we were unable to test its performance with a 16K input. This is because the native Falcon modelling module does not (yet) fully support FlashAttention, which significantly increases the latency and memory footprint when processing large inputs. To mitigate this issue, we had to set use_cache to False, resulting in even longer inference times due to repeated computations. Therefore, we recommend using TGI container as a more efficient alternative for this solution.

References

For this work, we benefited greatly from reading the following references

https://kaiokendev.github.io/til#extending-context-to-8k

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

https://github.com/jquesnelle/scaled-rope/blob/master/scaled_rope/LlamaDynamicScaledRotaryEmbedding.py

https://arxiv.org/abs/2306.15595

https://kexue.fm/archives/9675

Acknowledgements

We thank our Amazonian colleagues — Verdi March and Josh Passenger — for their valuable feedback to improve this blog post.

Appendix

A — Derive the base update formula in Step 2

base = self.init_base * \
       (seq_len / self.max_position_embeddings) ** (self.head_dim /
                                                   (self.head_dim - 2))

By using the same notation in Chen, Wong, Chen, and Tian (2023), we can express the RotaryEmbedding as

In the dNTK (by emozilla) method, the hyperparameter α is calculated as:
seq_len / max_position_embeddings

In summary, the rule for updating the base is

which is exactly what the above Python code snippet is doing.

B — TGI Client Hyper-parameters

typical_p=0.2
temperature=None # which by default is set to 1 at the server side
seed=1
truncate=16000 # tokens > this # are truncated. we set it high for testing
               # Set it to a smaller # (e.g. 11000) for regular workloads

All prompts are augmented structured as follows before being sent to the LLM on the TGI server
<|prompter|> {prompt}<|endoftext|><|assistant|>

C — HuggingFace transformers Implementation

We use the LLM prompt example given by bloc97 (shown in the Problem section) to interact with the Falcon40B-OA model loaded by HF Transformers running on an AWS g5.48x EC2 instance.

Falcon40B-OA’s response — 4bit quantization, Transformers 4.29.2
�� they you how this this this this them this this this this this this this this this this this this this this this this this this this this this this this this this this this this this this this this this the this this this this this this this this this this this this the the this this the same to determine this� your this the model

LongFalcon40B-OA response - 4bit quantization, Transformers 4.29.2
* Passkey retrieval experiments indicate that LLAMA models extended to 8192 context window size perform strongly outperform those extended to 2048 context window size.
* Direct fine-tuning methods.
* Long document summarization experiment.

Hyper parameter used with Transformers 4.29.2

temperature=0.0,
top_k=20,
repetition_penalty=1.2,
pad_token_id=11,  # tokenizer.eos_token,
use_cache=False

D — GPU memory usage statistics

This was measured on each rank (GPU) after the entire input context’s attention output is calculated but before the first new token is generated. Number of tokens in the input context = 16K, batch_size = 1. The numbers are averaged across all the 8 GPUs.

FP16 inference

torch.cuda.memory_allocated() = 9.99 GB
torch.cuda.memory_reserved() = 20.85 GB
torch.cuda.max_memory_reserved() = 20.84 GB

INT8 inference

torch.cuda.memory_allocated() = 5.34 GB
torch.cuda.memory_reserved() = 9.81 GB
torch.cuda.max_memory_reserved() = 13.31 GB