Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction, including both input and output.
The context window is the maximum amount of text (measured in tokens) that a language model can consider at once. It includes everything — the system prompt, conversation history, knowledge base content, and the model's response. Think of it as the model's working memory.
Context window sizes have grown dramatically: GPT-3 had 4,096 tokens (~3,000 words), GPT-4 Turbo supports 128,000 tokens (~96,000 words), Claude 3.5 supports 200,000 tokens (~150,000 words), and Gemini 1.5 Pro supports up to 2,000,000 tokens.
Larger context windows enable: longer conversations without losing context, processing entire documents or codebases, including more knowledge base content for accurate answers, and handling complex multi-step tasks that require extensive context.
However, more context isn't always better. Models can experience "lost in the middle" effects where information in the middle of a long context is less likely to be recalled. Processing longer contexts is also more expensive and slower.
For AI agent builders, context window management is crucial. Every conversation turn, system prompt instruction, and knowledge retrieval result consumes tokens. Effective agents use techniques like RAG to selectively include only relevant knowledge rather than stuffing the entire knowledge base into the context.
Token optimization strategies include: concise system prompts, summarizing long conversation histories, chunking knowledge base documents, and using retrieval (RAG) to include only relevant passages.
Related Terms
Tokens
FundamentalsThe basic units that language models use to process text — typically words, word pieces, or characters that the model reads and generates.
Large Language Model (LLM)
FundamentalsA neural network trained on massive text datasets that can understand and generate human-like language, powering modern AI assistants and agents.
Retrieval-Augmented Generation (RAG)
TechniquesA technique that enhances AI responses by retrieving relevant information from external knowledge sources before generating an answer.
Token Optimization
ArchitectureStrategies and techniques for reducing the number of tokens consumed when interacting with AI models, lowering costs and improving performance.
Build AI Agents Without Code
Turn these AI concepts into real products. Build custom AI agents on Chipp and deploy them in minutes.
Start Building Free