What is the Attention Mechanism in AI?

What is the attention mechanism?

The attention mechanism is a technique that allows AI models to focus on relevant parts of the input when producing each part of the output.

The problem it solves: Earlier models processed sequences left-to-right, forgetting earlier content. For "The cat sat on the mat because it was comfortable," understanding "it" requires remembering "cat" from 6 words back.

How attention helps: Instead of fixed left-to-right processing, attention lets the model look at all positions and decide which are relevant for each output.

Intuition: When you read "it" in the sentence above, your brain automatically thinks back to "cat" or "mat" to understand the reference. Attention models this: when processing "it," the model attends to "cat" (or "mat").

The 2017 paper "Attention Is All You Need" showed attention alone (without recurrence) could power state-of-the-art models. This transformer architecture now underlies all major LLMs.

How attention works

Query, Key, Value: Each token gets three representations:

Query (Q): "What am I looking for?"
Key (K): "What information do I have?"
Value (V): "What should I return?"

The process:

For each position, compute attention scores: how much should this position attend to every other position?
Scores = Query · Key (dot product measures similarity)
Normalize scores (softmax) so they sum to 1
Weighted sum of Values using scores

Example: Processing "it" in "The cat sat because it was tired":

"it"'s Query matches "cat"'s Key strongly
High attention score for "cat"
Output heavily influenced by "cat"'s Value

Formula:

Attention(Q, K, V) = softmax(QK^T / √d) × V

The √d scaling prevents scores from getting too large.

Multi-head attention

The insight: Different relationships matter: subject-verb agreement, coreference, syntactic structure. One attention pattern can't capture all of them.

Multi-head attention: Run multiple attention computations in parallel, each with different learned parameters.

Example:

Head 1: Tracks subject-verb relationships
Head 2: Tracks adjective-noun relationships
Head 3: Tracks coreference (pronouns to referents)
Head 4: Tracks sentiment-relevant words

In practice: GPT-3 has 96 attention heads per layer. Each head learns different patterns. Results are concatenated and combined.

Why it works: No single attention pattern works for all relationships. Multiple heads capture different aspects of how words relate.

Self-attention in transformers

Self-attention: Each position attends to all positions in the same sequence (including itself).

Cross-attention: Positions in one sequence attend to positions in another (used in encoder-decoder models).

In transformers: Self-attention is the core operation. Each layer:

Compute self-attention
Add residual connection
Layer normalization
Feed-forward network
Add residual connection
Layer normalization

Stacking layers: Early layers capture local patterns. Later layers capture global patterns. GPT-4 has ~100+ layers.

Causal attention: For generation, positions can only attend to previous positions (can't see the future). This enables autoregressive generation.

Impact of attention

Before attention:

RNNs processed sequentially—slow to train
Long-range dependencies were hard to learn
Gradients vanished over long sequences

After attention:

All positions processed in parallel—fast training
Direct connections between any positions
Enables massive models and datasets

Scaling: Attention enabled scaling that wasn't possible with RNNs. GPT-3's 175 billion parameters would be impractical with sequential processing.

Interpretability: Attention weights show what the model focuses on. Helps understand (somewhat) why models make decisions.

Extensions:

Sparse attention: Reduce computation for long sequences
Flash attention: Memory-efficient implementation
Linear attention: O(n) instead of O(n²) complexity

Attention is arguably the most important architectural innovation in modern AI.

Attention Mechanism