Architecture

Attention Mechanism

A technique that allows AI models to focus on relevant parts of input when processing, enabling better understanding of context and relationships.

What is the attention mechanism?

The attention mechanism is a technique that allows AI models to focus on relevant parts of the input when producing each part of the output.

The problem it solves: Earlier models processed sequences left-to-right, forgetting earlier content. For "The cat sat on the mat because it was comfortable," understanding "it" requires remembering "cat" from 6 words back.

How attention helps: Instead of fixed left-to-right processing, attention lets the model look at all positions and decide which are relevant for each output.

Intuition: When you read "it" in the sentence above, your brain automatically thinks back to "cat" or "mat" to understand the reference. Attention models this: when processing "it," the model attends to "cat" (or "mat").

The 2017 paper "Attention Is All You Need" showed attention alone (without recurrence) could power state-of-the-art models. This transformer architecture now underlies all major LLMs.

How attention works

Query, Key, Value: Each token gets three representations:

  • Query (Q): "What am I looking for?"
  • Key (K): "What information do I have?"
  • Value (V): "What should I return?"

The process:

  1. For each position, compute attention scores: how much should this position attend to every other position?
  2. Scores = Query · Key (dot product measures similarity)
  3. Normalize scores (softmax) so they sum to 1
  4. Weighted sum of Values using scores

Example: Processing "it" in "The cat sat because it was tired":

  • "it"'s Query matches "cat"'s Key strongly
  • High attention score for "cat"
  • Output heavily influenced by "cat"'s Value

Formula:

Attention(Q, K, V) = softmax(QK^T / √d) × V

The √d scaling prevents scores from getting too large.

Multi-head attention

The insight: Different relationships matter: subject-verb agreement, coreference, syntactic structure. One attention pattern can't capture all of them.

Multi-head attention: Run multiple attention computations in parallel, each with different learned parameters.

Example:

  • Head 1: Tracks subject-verb relationships
  • Head 2: Tracks adjective-noun relationships
  • Head 3: Tracks coreference (pronouns to referents)
  • Head 4: Tracks sentiment-relevant words

In practice: GPT-3 has 96 attention heads per layer. Each head learns different patterns. Results are concatenated and combined.

Why it works: No single attention pattern works for all relationships. Multiple heads capture different aspects of how words relate.

Self-attention in transformers

Self-attention: Each position attends to all positions in the same sequence (including itself).

Cross-attention: Positions in one sequence attend to positions in another (used in encoder-decoder models).

In transformers: Self-attention is the core operation. Each layer:

  1. Compute self-attention
  2. Add residual connection
  3. Layer normalization
  4. Feed-forward network
  5. Add residual connection
  6. Layer normalization

Stacking layers: Early layers capture local patterns. Later layers capture global patterns. GPT-4 has ~100+ layers.

Causal attention: For generation, positions can only attend to previous positions (can't see the future). This enables autoregressive generation.

Impact of attention

Before attention:

  • RNNs processed sequentially—slow to train
  • Long-range dependencies were hard to learn
  • Gradients vanished over long sequences

After attention:

  • All positions processed in parallel—fast training
  • Direct connections between any positions
  • Enables massive models and datasets

Scaling: Attention enabled scaling that wasn't possible with RNNs. GPT-3's 175 billion parameters would be impractical with sequential processing.

Interpretability: Attention weights show what the model focuses on. Helps understand (somewhat) why models make decisions.

Extensions:

  • Sparse attention: Reduce computation for long sequences
  • Flash attention: Memory-efficient implementation
  • Linear attention: O(n) instead of O(n²) complexity

Attention is arguably the most important architectural innovation in modern AI.