Intermediate·5 min read

Attention / Self-Attention

Attention is the core mechanism of Transformers that allows each token to dynamically focus on (or "attend to") other tokens in the sequence based on

Definition

Attention is the core mechanism of Transformers that allows each token to dynamically focus on (or "attend to") other tokens in the sequence based on relevance. Self-attention means each token queries all other tokens in the same sequence to determine how much each one influences its representation.

The Intuition

Consider: "The bank by the river was steep."

To understand what "bank" means, the model must look at "river" — they are semantically linked even though they're far apart. Self-attention gives the model a direct channel to connect any two tokens regardless of distance.

The Attention Formula

For each token, three vectors are computed:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain / advertise?"
  • Value (V): "What information do I carry?"
  • `

    Attention(Q, K, V) = softmax(QK^T / √d_k) × V

    `

    Step by step:

    1. QK^T — dot product between all query-key pairs → raw relevance scores (attention scores)

    2. / √d_k — scale to prevent vanishing gradients in softmax

    3. softmax(...) — convert scores to probabilities summing to 1 → attention weights

    4. × V — weighted sum of value vectors → output for each token

    Multi-Head Attention

    Instead of one attention computation, run H independent attention "heads" in parallel:

    `

    MultiHead(Q, K, V) = Concat(head_1, ..., head_H) × W_O

    where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)

    `

    Each head has its own projection matrices — it learns to attend to different types of relationships:

  • Head 1: syntactic subject-verb agreement
  • Head 2: coreference resolution
  • Head 3: semantic similarity
  • Head 4: positional relationships
  • ...etc (emergent, not manually assigned)
  • Causal (Masked) Attention

    For autoregressive generation (GPT, LLaMA), tokens can only attend to previous tokens, not future ones:

    `

    [The] → can attend to: [The]

    [cat] → can attend to: [The], [cat]

    [sat] → can attend to: [The], [cat], [sat]

    `

    Implemented by masking (setting to -∞) the upper triangle of the QK^T matrix before softmax. This is called the causal mask or attention mask.

    Attention Patterns (What Models Learn)

    Research shows different attention heads develop specialized roles:

    | Pattern Type | Description |

    |-------------|-------------|

    | Induction heads | Complete patterns seen earlier in context (in-context learning) |

    | Previous token heads | Attend strongly to the immediately preceding token |

    | Syntactic heads | Capture dependency relations (subject, object, modifier) |

    | Positional heads | Attend based on relative distance |

    | Semantic heads | Link semantically related tokens regardless of distance |

    Computational Complexity

    | Aspect | Value | Implication |

    |--------|-------|-------------|

    | Attention complexity | O(n² × d) | Quadratic in sequence length → long contexts are expensive |

    | Memory for KV cache | O(n × d × layers) | Long contexts consume lots of memory |

    | Per-token at decode | O(n × d) | Each new token must attend to all previous tokens |

    This quadratic scaling is why large context windows (200K tokens) require significant compute.

    Attention Variants

    Grouped Query Attention (GQA)

  • Multiple query heads share a single key/value head
  • Used by: LLaMA 3, Mistral, Gemma
  • Reduces KV cache memory with minimal quality loss
  • Multi-Query Attention (MQA)

  • All query heads share a single K and V
  • Maximum memory efficiency
  • Used by: PaLM, Falcon
  • Sliding Window Attention

  • Each token only attends to a window of W nearby tokens
  • Used by: Mistral (combined with global attention layers)
  • Reduces memory from O(n²) to O(n×W) for long contexts
  • Flash Attention

  • An implementation optimization (not an architectural change)
  • Rewrites the attention computation to be memory-bandwidth efficient using tiling
  • 2–4× faster than naive attention, enables longer sequences
  • Now standard in all major frameworks
  • Linear Attention (Mamba, RWKV)

  • Replaces quadratic attention with O(n) recurrent-style computation
  • Fundamental efficiency breakthrough for very long sequences
  • Still maturing; quality slightly below Transformer attention on many benchmarks
  • Attention vs. Memory

    Attention is often described as giving models a "memory" — but it's temporary:

  • Self-attention sees everything in the current context window
  • It has no memory of previous conversations
  • It is computed fresh from scratch each forward pass
  • The KV cache makes this efficient but does not persist across sessions
  • Visualizing Attention

    Attention weights can be visualized as heatmaps:

  • Row = query token (which token is attending)
  • Column = key token (which token is being attended to)
  • Color = attention weight (how strongly)
  • Tools: BertViz, TransformerLens, attention_visualizer

    Why Attention Is the Key Innovation

    | Before Attention (RNNs) | With Attention (Transformers) |

    |------------------------|------------------------------|

    | Information compressed into fixed-size state | Direct access to any token in context |

    | Gradient vanishes over long sequences | Direct path — no vanishing gradient |

    | Sequential processing | Fully parallel processing |

    | Hard to capture long-range dependencies | Trivially captures any dependency |

    | Limited scalability | Scales to trillions of parameters |

    Related Concepts

  • Transformer, Embeddings, KV Cache, Context Window, Latent Space, Multi-Head Attention, Flash Attention

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 8).