Attention Mechanism
A technique that allows AI models to focus on relevant parts of input when processing, enabling better understanding of context and relationships.
What is the attention mechanism?
The attention mechanism is a technique that allows AI models to focus on relevant parts of the input when producing each part of the output.
The problem it solves: Earlier models processed sequences left-to-right, forgetting earlier content. For "The cat sat on the mat because it was comfortable," understanding "it" requires remembering "cat" from 6 words back.
How attention helps: Instead of fixed left-to-right processing, attention lets the model look at all positions and decide which are relevant for each output.
Intuition: When you read "it" in the sentence above, your brain automatically thinks back to "cat" or "mat" to understand the reference. Attention models this: when processing "it," the model attends to "cat" (or "mat").
The 2017 paper "Attention Is All You Need" showed attention alone (without recurrence) could power state-of-the-art models. This transformer architecture now underlies all major LLMs.
How attention works
Query, Key, Value: Each token gets three representations:
- Query (Q): "What am I looking for?"
- Key (K): "What information do I have?"
- Value (V): "What should I return?"
The process:
- For each position, compute attention scores: how much should this position attend to every other position?
- Scores = Query · Key (dot product measures similarity)
- Normalize scores (softmax) so they sum to 1
- Weighted sum of Values using scores
Example: Processing "it" in "The cat sat because it was tired":
- "it"'s Query matches "cat"'s Key strongly
- High attention score for "cat"
- Output heavily influenced by "cat"'s Value
Formula:
Attention(Q, K, V) = softmax(QK^T / √d) × V
The √d scaling prevents scores from getting too large.
Multi-head attention
The insight: Different relationships matter: subject-verb agreement, coreference, syntactic structure. One attention pattern can't capture all of them.
Multi-head attention: Run multiple attention computations in parallel, each with different learned parameters.
Example:
- Head 1: Tracks subject-verb relationships
- Head 2: Tracks adjective-noun relationships
- Head 3: Tracks coreference (pronouns to referents)
- Head 4: Tracks sentiment-relevant words
In practice: GPT-3 has 96 attention heads per layer. Each head learns different patterns. Results are concatenated and combined.
Why it works: No single attention pattern works for all relationships. Multiple heads capture different aspects of how words relate.
Self-attention in transformers
Self-attention: Each position attends to all positions in the same sequence (including itself).
Cross-attention: Positions in one sequence attend to positions in another (used in encoder-decoder models).
In transformers: Self-attention is the core operation. Each layer:
- Compute self-attention
- Add residual connection
- Layer normalization
- Feed-forward network
- Add residual connection
- Layer normalization
Stacking layers: Early layers capture local patterns. Later layers capture global patterns. GPT-4 has ~100+ layers.
Causal attention: For generation, positions can only attend to previous positions (can't see the future). This enables autoregressive generation.
Impact of attention
Before attention:
- RNNs processed sequentially—slow to train
- Long-range dependencies were hard to learn
- Gradients vanished over long sequences
After attention:
- All positions processed in parallel—fast training
- Direct connections between any positions
- Enables massive models and datasets
Scaling: Attention enabled scaling that wasn't possible with RNNs. GPT-3's 175 billion parameters would be impractical with sequential processing.
Interpretability: Attention weights show what the model focuses on. Helps understand (somewhat) why models make decisions.
Extensions:
- Sparse attention: Reduce computation for long sequences
- Flash attention: Memory-efficient implementation
- Linear attention: O(n) instead of O(n²) complexity
Attention is arguably the most important architectural innovation in modern AI.
Related Terms
Transformer
The neural network architecture that powers most modern AI language models, using attention mechanisms to process sequences efficiently.
Large Language Model (LLM)
A neural network trained on massive text datasets that can understand and generate human-like language.
Neural Network
A computing system inspired by the human brain, using interconnected nodes (neurons) to learn patterns from data.