Foundation Model 101 — Is Large Context Window A Trend?

3 min readMay 10, 2023

The context window of large language models (LLMs) is the range of tokens the model can consider when generating responses to prompts. GPT models start with 2K window size (GPT-3) and go all the way to 32K (GPT-4). The expansion of the context window size has a significant impact on the model’s performance and usefulness across various applications. By having a larger context window, LLM is capable of handling more lengthy inputs such as entire documents, or comprehending the full scope of an article. This ability enables LLM to produce more contextually relevant responses by leveraging a more comprehensive understanding of the input.

Context Window Size of different GPT model versions

So, is having large context window a trend, with the idea that the bigger the window, the better? I tend not to think so.

One of the obstacles of large context window is that the costs increase quadratically as the number of tokens is increased, i.e., doubling the token length from 4k to 8k isn’t 2x as expensive, but 4x more expensive. Consequently, dealing with very long inputs can significantly slow down the model’s computations.

There is some research investigating how to make cost near-linear in the number of tokens, which is a significant advancement. However, I still don’t believe it’s sufficient for two reasons.

First, increasing the context window size doesn’t completely eliminate LLM hallucinations. While more context reduces errors caused by lacking context, it creates opportunities for mistakes within the context itself.

Vygotsky’s Zone of Proximal Development (ZPD)

Second, expanding the context window alone contradicts Lev Vygotsky’s ZPD theory, a highly influential theory in education. According to ZPD theory, the key to bringing learners to the next level is to identify their zone (through prompt engineering LLMs) and scaffold with tailored instructions (fine-tuning LLMs). A teacher wouldn’t give a student a 100-page book (long context) and ask the student to answer any questions (text generation). Instead, the right way is to instruct (fine-tuning) the student (LLM) to build the knowledge (model skills). Merely expanding the context window constraints LLMs in its “current understanding zone”. You cannot count on an LLM to execute any tasks out of its zone. Even for in-zone tasks, it doesn’t make sense that repeating similar tasks doesn’t reduce the average cost at all. Imagine if you asked your intern to complete a task on Monday, and it took the intern 2.5 hours to get familiar with the task and 0.5 hour to execute the task. Then you asked the intern to do a similar task on Tuesday, but the intern acted like they had never heard of it before and spent another 3 hours. It’s like their memory reset overnight, and you’re left feeling like you’re losing your mind.

In summary, we need to strike a balance between model skills and model usage, instead of relying solely on LLMs to expand context window size to millions of tokens.

Foundation Model 101 — Is Large Context Window A Trend?

Written by Changsha Ma