Definition
Attention is the core mechanism of Transformers that allows each token to dynamically focus on (or "attend to") other tokens in the sequence based on relevance. Self-attention means each token queries all other tokens in the same sequence to determine how much each one influences its representation.
The Intuition
Consider: "The bank by the river was steep."
To understand what "bank" means, the model must look at "river" — they are semantically linked even though they're far apart. Self-attention gives the model a direct channel to connect any two tokens regardless of distance.
The Attention Formula
For each token, three vectors are computed:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain / advertise?"
- Value (V): "What information do I carry?"
- Head 1: syntactic subject-verb agreement
- Head 2: coreference resolution
- Head 3: semantic similarity
- Head 4: positional relationships
- ...etc (emergent, not manually assigned)
- Multiple query heads share a single key/value head
- Used by: LLaMA 3, Mistral, Gemma
- Reduces KV cache memory with minimal quality loss
- All query heads share a single K and V
- Maximum memory efficiency
- Used by: PaLM, Falcon
- Each token only attends to a window of W nearby tokens
- Used by: Mistral (combined with global attention layers)
- Reduces memory from O(n²) to O(n×W) for long contexts
- An implementation optimization (not an architectural change)
- Rewrites the attention computation to be memory-bandwidth efficient using tiling
- 2–4× faster than naive attention, enables longer sequences
- Now standard in all major frameworks
- Replaces quadratic attention with O(n) recurrent-style computation
- Fundamental efficiency breakthrough for very long sequences
- Still maturing; quality slightly below Transformer attention on many benchmarks
- Self-attention sees everything in the current context window
- It has no memory of previous conversations
- It is computed fresh from scratch each forward pass
- The KV cache makes this efficient but does not persist across sessions
- Row = query token (which token is attending)
- Column = key token (which token is being attended to)
- Color = attention weight (how strongly)
- Transformer, Embeddings, KV Cache, Context Window, Latent Space, Multi-Head Attention, Flash Attention
`
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
`
Step by step:
1. QK^T — dot product between all query-key pairs → raw relevance scores (attention scores)
2. / √d_k — scale to prevent vanishing gradients in softmax
3. softmax(...) — convert scores to probabilities summing to 1 → attention weights
4. × V — weighted sum of value vectors → output for each token
Multi-Head Attention
Instead of one attention computation, run H independent attention "heads" in parallel:
`
MultiHead(Q, K, V) = Concat(head_1, ..., head_H) × W_O
where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)
`
Each head has its own projection matrices — it learns to attend to different types of relationships:
Causal (Masked) Attention
For autoregressive generation (GPT, LLaMA), tokens can only attend to previous tokens, not future ones:
`
[The] → can attend to: [The]
[cat] → can attend to: [The], [cat]
[sat] → can attend to: [The], [cat], [sat]
`
Implemented by masking (setting to -∞) the upper triangle of the QK^T matrix before softmax. This is called the causal mask or attention mask.
Attention Patterns (What Models Learn)
Research shows different attention heads develop specialized roles:
| Pattern Type | Description |
|-------------|-------------|
| Induction heads | Complete patterns seen earlier in context (in-context learning) |
| Previous token heads | Attend strongly to the immediately preceding token |
| Syntactic heads | Capture dependency relations (subject, object, modifier) |
| Positional heads | Attend based on relative distance |
| Semantic heads | Link semantically related tokens regardless of distance |
Computational Complexity
| Aspect | Value | Implication |
|--------|-------|-------------|
| Attention complexity | O(n² × d) | Quadratic in sequence length → long contexts are expensive |
| Memory for KV cache | O(n × d × layers) | Long contexts consume lots of memory |
| Per-token at decode | O(n × d) | Each new token must attend to all previous tokens |
This quadratic scaling is why large context windows (200K tokens) require significant compute.
Attention Variants
Grouped Query Attention (GQA)
Multi-Query Attention (MQA)
Sliding Window Attention
Flash Attention
Linear Attention (Mamba, RWKV)
Attention vs. Memory
Attention is often described as giving models a "memory" — but it's temporary:
Visualizing Attention
Attention weights can be visualized as heatmaps:
Tools: BertViz, TransformerLens, attention_visualizer
Why Attention Is the Key Innovation
| Before Attention (RNNs) | With Attention (Transformers) |
|------------------------|------------------------------|
| Information compressed into fixed-size state | Direct access to any token in context |
| Gradient vanishes over long sequences | Direct path — no vanishing gradient |
| Sequential processing | Fully parallel processing |
| Hard to capture long-range dependencies | Trivially captures any dependency |
| Limited scalability | Scales to trillions of parameters |