Architecture

Token Optimization

**Token optimization** refers to strategies and techniques for reducing the number of tokens consumed when interacting with large language models (LLMs), directly impacting both cost and performance.

Why It Matters

LLM APIs charge per token (both input and output). A single Claude Opus 4 request with 100k context can cost several dollars. Optimizing token usage can reduce costs by 10-100x.

Common Strategies

1. Targeted Retrieval (vs. Context Stuffing)

Instead of loading entire documents into context, use semantic search to retrieve only relevant snippets.

  • Tools: QMD, RAG pipelines, vector databases

2. Prompt Compression

Remove unnecessary words, whitespace, and formatting from prompts without losing meaning.

3. Caching

Cache LLM responses for repeated queries. Many providers offer prompt caching (reduced cost for repeated context).

4. Model Selection

Use smaller/cheaper models for routine tasks, expensive models for complex reasoning.

  • Routine: Sonnet, Kimi K2, local models
  • Complex: Opus, o1-preview

5. Output Length Control

Explicitly request concise responses when brevity suffices.

Resources

Build AI agents with Chipp

Create custom AI agents with knowledge, actions, and integrations—no coding required.

Learn more