Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction.
What is a context window?
The context window is the maximum amount of text a language model can "see" at once. It's like the model's working memory—everything it needs to consider must fit within this window.
The context window includes:
- System prompt (instructions to the model)
- Conversation history (previous messages)
- User's current input
- Any documents or context you provide
- Space for the model's response
Measured in tokens (roughly 0.75 words each), context windows range from 4,000 tokens in older models to over 1 million in newer ones.
If you exceed the limit:
- Some APIs truncate older messages
- Some return an error
- Quality may degrade even before hitting the limit
Why does context window matter?
Long documents Want to analyze a 100-page document? You need a model with enough context to hold it. A 4K context model can handle ~12 pages; a 200K model can handle ~600 pages.
Extended conversations As conversations grow, earlier messages must fit in the context. Long customer support sessions or multi-turn dialogues require larger windows.
Complex tasks Tasks requiring lots of context—comparing multiple documents, analyzing codebases, processing data—need room for all relevant information.
RAG systems Retrieved documents consume context space. Larger windows mean you can include more relevant context.
Code understanding Analyzing interconnected code files requires seeing many files simultaneously. Large context windows enable understanding entire codebases.
Context windows by model
| Model | Context Window |
|---|---|
| GPT-3.5 Turbo | 4K or 16K tokens |
| GPT-4 | 8K or 128K tokens |
| GPT-4o | 128K tokens |
| Claude 3 Haiku | 200K tokens |
| Claude 3 Sonnet | 200K tokens |
| Claude 3 Opus | 200K tokens |
| Gemini 1.5 Pro | 1M+ tokens |
| Llama 3 8B | 8K tokens |
| Llama 3 70B | 8K-128K tokens |
What the numbers mean:
- 4K tokens ≈ 3,000 words ≈ 6 pages
- 32K tokens ≈ 24,000 words ≈ 48 pages
- 128K tokens ≈ 96,000 words ≈ 192 pages
- 200K tokens ≈ 150,000 words ≈ 300 pages
- 1M tokens ≈ 750,000 words ≈ 3 novels
Context window limitations
Bigger isn't always better
Lost in the middle problem: Research shows models pay more attention to the beginning and end of context, sometimes missing information in the middle. A 200K context doesn't mean perfect recall of 200K tokens.
Speed and cost: Longer contexts take longer to process and cost more. A 100K token prompt costs more than 100x a 1K prompt (and takes longer).
Quality degradation: As context grows, response quality can decrease. The model has more information but may struggle to identify what's most relevant.
Not true memory: Context window isn't persistent memory. Each API call starts fresh—you must resend conversation history every time.
Effective context varies: A model might accept 200K tokens but perform best with 50K. Test with your actual use case.
Managing context effectively
Prioritize information Put the most important context at the beginning or end. Don't bury critical information in the middle.
Summarize history For long conversations, periodically summarize earlier exchanges rather than keeping verbatim history.
Use RAG Instead of stuffing everything in context, retrieve only relevant portions of large document sets.
Chunk strategically Process long documents in chunks, synthesizing results. Map-reduce patterns work well.
Prune ruthlessly Remove unnecessary context. Every token you don't need is cost and potential noise saved.
Monitor usage Track how much context you're actually using. You might be surprised.
Test at scale What works with 10K tokens might fail at 100K. Test at realistic context sizes.
The future of context windows
Context windows keep growing:
- 2022: 4K tokens was standard
- 2023: 32K-100K became available
- 2024: 1M+ tokens in production
New techniques enabling longer context:
- Sparse attention: Focus on relevant parts of context, not everything
- Memory architectures: Separate long-term memory from working context
- Retrieval augmentation: Dynamic retrieval instead of static context
- Compression: Represent information more efficiently
What this enables:
- Analyzing entire codebases in one prompt
- Processing full books or research papers
- Maintaining truly long-term conversations
- Complex multi-document reasoning
The real limit: Even with infinite context, there are practical limits—cost, latency, and the model's ability to effectively use that information. Effective context management remains important regardless of window size.
Related Terms
Tokens
The basic units that language models use to process text, typically representing parts of words, whole words, or punctuation.
Large Language Model (LLM)
A neural network trained on massive text datasets that can understand and generate human-like language.
Retrieval-Augmented Generation (RAG)
A technique that enhances AI responses by retrieving relevant information from external knowledge sources before generating an answer.