Techniques

Pre-training

The initial phase of training AI models on large datasets to learn general patterns before specializing for specific tasks.

What is pre-training?

Pre-training is the initial phase where AI models learn general patterns from large, diverse datasets before being adapted for specific tasks.

The idea: Rather than training a model from scratch for each task, first train a general-purpose model on massive data. This "pre-trained" model becomes a starting point for many downstream applications.

For language models:

  • Data: Trillions of tokens from internet, books, code
  • Task: Predict the next word (or fill in masked words)
  • Result: Model learns language, facts, reasoning patterns

Why it works: Predicting text requires understanding language structure, world knowledge, and reasoning. A model that can accurately predict what comes next has learned a lot about how language and concepts work.

How pre-training works

Data collection: Gather massive datasets. For GPT-style models:

  • Web pages (Common Crawl)
  • Books and articles
  • Code repositories
  • Wikipedia
  • Filtered for quality

Training objective:

Causal language modeling (GPT-style): Predict the next token given previous tokens. "The cat sat on the ___" → "mat"

Masked language modeling (BERT-style): Predict masked tokens from context. "The cat [MASK] on the mat" → "sat"

Training process:

  1. Tokenize text into tokens
  2. Feed through neural network
  3. Compare prediction to actual next token
  4. Calculate loss (how wrong)
  5. Backpropagate to update weights
  6. Repeat billions of times

Scale: GPT-3: 300 billion tokens, thousands of GPUs, weeks of training GPT-4: Estimated trillions of tokens

What pre-training teaches

Language structure: Grammar, syntax, common phrases, writing styles. Models learn to produce fluent text.

World knowledge: Facts, concepts, relationships. "Paris is the capital of France" encoded in weights.

Reasoning patterns: Logical inference, cause and effect, problem-solving approaches.

Task patterns: Question-answer format, instruction following, summarization. Models see many examples of each.

Limitations:

  • Knowledge cutoff: Only knows what was in training data
  • Biases: Reflects biases in training data
  • Hallucination: Can generate plausible but false information
  • No real understanding: Pattern matching, not true comprehension

Pre-trained models are remarkably capable but have consistent failure modes that downstream applications must address.

After pre-training

Raw pre-trained model: Can complete text but not optimized for following instructions or being helpful.

Instruction fine-tuning: Train on instruction-response pairs. "Summarize this article: [text]" → "[summary]" Makes model better at following directions.

RLHF (Reinforcement Learning from Human Feedback): Human raters compare outputs. Model learns to prefer responses humans rate higher.

Constitutional AI: Train model to evaluate its own outputs against principles and improve.

Domain fine-tuning: Specialize for specific domains: medical, legal, code, etc.

The pipeline: Pre-training → Instruction tuning → RLHF → (Optional: Domain fine-tuning)

ChatGPT, Claude, and other assistants go through all these stages. The pre-trained model is just the starting point.

Pre-training in practice

Who does pre-training: Only organizations with massive resources:

  • OpenAI (GPT series)
  • Anthropic (Claude)
  • Google (Gemini)
  • Meta (Llama)
  • Mistral, Cohere, etc.

Cost:

  • GPT-3: ~$4.6M compute cost
  • GPT-4: Estimated $50-100M+
  • Requires specialized infrastructure, data pipelines, engineering

Most organizations don't pre-train: Instead, use pre-trained models via:

  • APIs (OpenAI, Anthropic)
  • Open-source models (Llama, Mistral)
  • Fine-tuning existing models

When pre-training makes sense:

  • Control over training data (privacy, quality)
  • Unique domain with insufficient coverage
  • Cost optimization at extreme scale
  • Research purposes

For 99% of applications: Start with existing pre-trained models and adapt through prompting, RAG, or fine-tuning.