Tokens
The basic units that language models use to process text, typically representing parts of words, whole words, or punctuation.
What are tokens?
Tokens are the basic units that language models use to process text. Rather than reading character by character or word by word, LLMs break text into tokens—chunks that balance efficiency and meaning.
A token might be:
- A whole word: "hello" → 1 token
- Part of a word: "tokenization" → "token" + "ization" → 2 tokens
- Punctuation: "!" → 1 token
- A space: " " → often combined with the next word
Rule of thumb for English:
- 1 token ≈ 4 characters
- 1 token ≈ 0.75 words
- 100 tokens ≈ 75 words
This varies by language. Chinese, Japanese, and Korean typically use more tokens per character than English.
How does tokenization work?
Tokenization algorithms break text into tokens using learned patterns. The most common approach is Byte Pair Encoding (BPE):
- Start with individual characters as tokens
- Find the most frequent pair of tokens
- Merge that pair into a new token
- Repeat until reaching a target vocabulary size
This creates a vocabulary where:
- Common words are single tokens: "the", "is", "and"
- Less common words are split: "tokenization" → "token" + "ization"
- Rare words might be character-level: "qxyz" → "q" + "x" + "y" + "z"
Different models use different tokenizers:
- GPT-4 uses cl100k_base (~100K token vocabulary)
- Claude uses its own tokenizer
- Llama uses SentencePiece
Important: You must use the same tokenizer the model uses. A text that's 100 tokens in GPT-4's tokenizer might be 110 tokens in another.
Why do tokens matter?
Pricing Most AI APIs charge per token:
- Input tokens (your prompt)
- Output tokens (the response)
GPT-4 might cost $0.01 per 1K input tokens and $0.03 per 1K output tokens. A 10K token prompt with a 2K token response costs $0.16.
Context limits Every model has a maximum context window measured in tokens:
- GPT-3.5: 4K or 16K tokens
- GPT-4: 8K or 128K tokens
- Claude 3: 200K tokens
- Gemini 1.5: 1M+ tokens
Your prompt + desired response must fit within this limit.
Processing speed More tokens = longer processing time. A 100K token input takes longer than a 1K token input, even for models that support it.
Quality implications Very long contexts can degrade response quality. Models may "lose" information in the middle of long documents (the "lost in the middle" problem).
How to count tokens
Online tools
- OpenAI Tokenizer: platform.openai.com/tokenizer
- Anthropic Console shows token counts
- Various third-party tools
Programmatically
For OpenAI models:
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
print(len(tokens)) # Output: 4
Quick estimates
- Characters ÷ 4 ≈ tokens
- Words × 1.3 ≈ tokens
- 1 page of text ≈ 500 tokens
Accurate counting matters when:
- You're near context limits
- Optimizing for cost
- Building production applications
- Working with structured prompts
Managing token usage
Reduce input tokens:
- Summarize long documents before including them
- Include only relevant context, not everything
- Use concise prompts—every word counts
- Remove redundant instructions
Reduce output tokens:
- Ask for concise responses: "Answer in 2-3 sentences"
- Request specific formats: "Respond with only the answer"
- Use structured output to eliminate fluff
- Set max_tokens parameter to limit response length
Optimize for cost:
- Use cheaper models for simple tasks
- Cache common responses
- Batch similar requests
- Monitor usage and set alerts
Handle context limits:
- Implement chunking for long documents
- Use RAG to retrieve only relevant sections
- Summarize conversation history for long chats
- Consider models with larger context windows
Token tips and tricks
Whitespace matters " hello" (with leading space) and "hello" are different tokens. Inconsistent spacing can affect outputs.
Numbers are expensive "123456789" might be 3+ tokens, while "one hundred twenty-three million" is many more. Choose representations wisely.
Code is token-hungry Code with long variable names, comments, and whitespace uses many tokens. Minified code uses fewer but is harder to process.
JSON structure
{"name":"John","age":30}
Uses fewer tokens than:
{
"name": "John",
"age": 30
}
Prompt caching Some APIs (like Anthropic) cache common prompt prefixes, reducing cost for repeated prompts with the same system instructions.
Special tokens Models use special tokens for structure: <|system|>, <|user|>, etc. These count toward limits and aren't always visible to you.
Related Terms
Large Language Model (LLM)
A neural network trained on massive text datasets that can understand and generate human-like language.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction.
Embeddings
Numerical representations of text, images, or other data that capture semantic meaning in a format AI models can process.