Fundamentals

Tokens

The basic units that language models use to process text, typically representing parts of words, whole words, or punctuation.

What are tokens?

Tokens are the basic units that language models use to process text. Rather than reading character by character or word by word, LLMs break text into tokens—chunks that balance efficiency and meaning.

A token might be:

  • A whole word: "hello" → 1 token
  • Part of a word: "tokenization" → "token" + "ization" → 2 tokens
  • Punctuation: "!" → 1 token
  • A space: " " → often combined with the next word

Rule of thumb for English:

  • 1 token ≈ 4 characters
  • 1 token ≈ 0.75 words
  • 100 tokens ≈ 75 words

This varies by language. Chinese, Japanese, and Korean typically use more tokens per character than English.

How does tokenization work?

Tokenization algorithms break text into tokens using learned patterns. The most common approach is Byte Pair Encoding (BPE):

  1. Start with individual characters as tokens
  2. Find the most frequent pair of tokens
  3. Merge that pair into a new token
  4. Repeat until reaching a target vocabulary size

This creates a vocabulary where:

  • Common words are single tokens: "the", "is", "and"
  • Less common words are split: "tokenization" → "token" + "ization"
  • Rare words might be character-level: "qxyz" → "q" + "x" + "y" + "z"

Different models use different tokenizers:

  • GPT-4 uses cl100k_base (~100K token vocabulary)
  • Claude uses its own tokenizer
  • Llama uses SentencePiece

Important: You must use the same tokenizer the model uses. A text that's 100 tokens in GPT-4's tokenizer might be 110 tokens in another.

Why do tokens matter?

Pricing Most AI APIs charge per token:

  • Input tokens (your prompt)
  • Output tokens (the response)

GPT-4 might cost $0.01 per 1K input tokens and $0.03 per 1K output tokens. A 10K token prompt with a 2K token response costs $0.16.

Context limits Every model has a maximum context window measured in tokens:

  • GPT-3.5: 4K or 16K tokens
  • GPT-4: 8K or 128K tokens
  • Claude 3: 200K tokens
  • Gemini 1.5: 1M+ tokens

Your prompt + desired response must fit within this limit.

Processing speed More tokens = longer processing time. A 100K token input takes longer than a 1K token input, even for models that support it.

Quality implications Very long contexts can degrade response quality. Models may "lose" information in the middle of long documents (the "lost in the middle" problem).

How to count tokens

Online tools

  • OpenAI Tokenizer: platform.openai.com/tokenizer
  • Anthropic Console shows token counts
  • Various third-party tools

Programmatically

For OpenAI models:

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
print(len(tokens))  # Output: 4

Quick estimates

  • Characters ÷ 4 ≈ tokens
  • Words × 1.3 ≈ tokens
  • 1 page of text ≈ 500 tokens

Accurate counting matters when:

  • You're near context limits
  • Optimizing for cost
  • Building production applications
  • Working with structured prompts

Managing token usage

Reduce input tokens:

  • Summarize long documents before including them
  • Include only relevant context, not everything
  • Use concise prompts—every word counts
  • Remove redundant instructions

Reduce output tokens:

  • Ask for concise responses: "Answer in 2-3 sentences"
  • Request specific formats: "Respond with only the answer"
  • Use structured output to eliminate fluff
  • Set max_tokens parameter to limit response length

Optimize for cost:

  • Use cheaper models for simple tasks
  • Cache common responses
  • Batch similar requests
  • Monitor usage and set alerts

Handle context limits:

  • Implement chunking for long documents
  • Use RAG to retrieve only relevant sections
  • Summarize conversation history for long chats
  • Consider models with larger context windows

Token tips and tricks

Whitespace matters " hello" (with leading space) and "hello" are different tokens. Inconsistent spacing can affect outputs.

Numbers are expensive "123456789" might be 3+ tokens, while "one hundred twenty-three million" is many more. Choose representations wisely.

Code is token-hungry Code with long variable names, comments, and whitespace uses many tokens. Minified code uses fewer but is harder to process.

JSON structure

{"name":"John","age":30}

Uses fewer tokens than:

{
  "name": "John",
  "age": 30
}

Prompt caching Some APIs (like Anthropic) cache common prompt prefixes, reducing cost for repeated prompts with the same system instructions.

Special tokens Models use special tokens for structure: <|system|>, <|user|>, etc. These count toward limits and aren't always visible to you.