Attention / Self-Attention — FDE@ProdAI Blog

Definition

Attention is the core mechanism of Transformers that allows each token to dynamically focus on (or "attend to") other tokens in the sequence based on relevance. Self-attention means each token queries all other tokens in the same sequence to determine how much each one influences its representation.

The Intuition

Consider: "The bank by the river was steep."

To understand what "bank" means, the model must look at "river" — they are semantically linked even though they're far apart. Self-attention gives the model a direct channel to connect any two tokens regardless of distance.

The Attention Formula

For each token, three vectors are computed:

Query (Q): "What am I looking for?"
Key (K): "What do I contain / advertise?"
Value (V): "What information do I carry?"

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Step by step:

1. QK^T — dot product between all query-key pairs → raw relevance scores (attention scores)

2. / √d_k — scale to prevent vanishing gradients in softmax

3. softmax(...) — convert scores to probabilities summing to 1 → attention weights

4. × V — weighted sum of value vectors → output for each token

Multi-Head Attention

Instead of one attention computation, run H independent attention "heads" in parallel:

MultiHead(Q, K, V) = Concat(head_1, ..., head_H) × W_O

where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)

Each head has its own projection matrices — it learns to attend to different types of relationships:

Head 1: syntactic subject-verb agreement
Head 2: coreference resolution
Head 3: semantic similarity
Head 4: positional relationships
...etc (emergent, not manually assigned)

Causal (Masked) Attention

For autoregressive generation (GPT, LLaMA), tokens can only attend to previous tokens, not future ones:

[The] → can attend to: [The]

[cat] → can attend to: [The], [cat]

[sat] → can attend to: [The], [cat], [sat]

Implemented by masking (setting to -∞) the upper triangle of the QK^T matrix before softmax. This is called the causal mask or attention mask.

Attention Patterns (What Models Learn)

Research shows different attention heads develop specialized roles:

| Pattern Type | Description |

|-------------|-------------|

| Induction heads | Complete patterns seen earlier in context (in-context learning) |

| Previous token heads | Attend strongly to the immediately preceding token |

| Syntactic heads | Capture dependency relations (subject, object, modifier) |

| Positional heads | Attend based on relative distance |

| Semantic heads | Link semantically related tokens regardless of distance |

Computational Complexity

| Aspect | Value | Implication |

|--------|-------|-------------|

| Attention complexity | O(n² × d) | Quadratic in sequence length → long contexts are expensive |

| Memory for KV cache | O(n × d × layers) | Long contexts consume lots of memory |

| Per-token at decode | O(n × d) | Each new token must attend to all previous tokens |

This quadratic scaling is why large context windows (200K tokens) require significant compute.

Attention Variants

Grouped Query Attention (GQA)

Multiple query heads share a single key/value head
Used by: LLaMA 3, Mistral, Gemma
Reduces KV cache memory with minimal quality loss

Multi-Query Attention (MQA)

All query heads share a single K and V
Maximum memory efficiency
Used by: PaLM, Falcon

Sliding Window Attention

Each token only attends to a window of W nearby tokens
Used by: Mistral (combined with global attention layers)
Reduces memory from O(n²) to O(n×W) for long contexts

Flash Attention

An implementation optimization (not an architectural change)
Rewrites the attention computation to be memory-bandwidth efficient using tiling
2–4× faster than naive attention, enables longer sequences
Now standard in all major frameworks

Linear Attention (Mamba, RWKV)

Replaces quadratic attention with O(n) recurrent-style computation
Fundamental efficiency breakthrough for very long sequences
Still maturing; quality slightly below Transformer attention on many benchmarks

Attention vs. Memory

Attention is often described as giving models a "memory" — but it's temporary:

Self-attention sees everything in the current context window
It has no memory of previous conversations
It is computed fresh from scratch each forward pass
The KV cache makes this efficient but does not persist across sessions

Visualizing Attention

Attention weights can be visualized as heatmaps:

Row = query token (which token is attending)
Column = key token (which token is being attended to)
Color = attention weight (how strongly)

Tools: BertViz, TransformerLens, attention_visualizer

Why Attention Is the Key Innovation

| Before Attention (RNNs) | With Attention (Transformers) |

|------------------------|------------------------------|

| Information compressed into fixed-size state | Direct access to any token in context |

| Gradient vanishes over long sequences | Direct path — no vanishing gradient |

| Sequential processing | Fully parallel processing |

| Hard to capture long-range dependencies | Trivially captures any dependency |

| Limited scalability | Scales to trillions of parameters |

Related Concepts

Transformer, Embeddings, KV Cache, Context Window, Latent Space, Multi-Head Attention, Flash Attention