Member-only story
Extending Context Length in Large Language Models
How to turn your Llama into a Giraffe
Context length refers to the maximum number of tokens the model can remember when generating text. A longer context window allows the model to understand long-range dependencies in text better. Models with longer contexts can build connections between ideas far apart in the text, generating more globally coherent outputs.
During training, the model processes the text data in chunks or fixed-length windows. Models need to be trained on lengthy texts to actually leverage long contexts. Training sequences must contain documents, books, articles, etc., with thousands of tokens.
The length of training data sets a limit on usable context length.
So, why don’t we train models on longer sequences?
Not so fast.
Increasing context length increases the number of possible token combinations the model must learn to predict accurately.
This enables more robust long-range modeling but also require more memory and processing power, leading to higher training costs.
Without any optimization, computation scales quadratically with context length — meaning that a 4096 token model will need 64 times more…