Intermediate·4 min read

KV Cache (Key-Value Cache)

The KV Cache (Key-Value Cache) is an optimization that stores the computed attention Key and Value matrices for all previously processed tokens during

Definition

The KV Cache (Key-Value Cache) is an optimization that stores the computed attention Key and Value matrices for all previously processed tokens during inference, so they don't need to be recomputed when generating each new token. It is the reason autoregressive generation is feasible at reasonable speed.

Why It's Necessary

In self-attention, every token attends to every previous token. Without caching, generating token N requires re-processing all N-1 previous tokens from scratch — O(N²) total work:

`

Without KV Cache:

Token 1: process 1 token

Token 2: re-process 2 tokens

Token 3: re-process 3 tokens

...

Token N: re-process N tokens

Total work: O(N²) ← catastrophically slow

`

With KV Cache:

`

Prefill: process all P prompt tokens once → store K,V for all P tokens

Token 1: compute K,V for 1 new token only + attend to cached K,V

Token 2: compute K,V for 1 new token only + attend to cached K,V

...

Total decode work: O(P) prefill + O(N) decode ← manageable

`

What Gets Cached

For each Transformer layer, each attention head, at each sequence position:

  • K (Key matrix): what the token "advertises" about itself
  • V (Value matrix): the actual information the token carries
  • These are computed once and stored in GPU memory.

    `

    KV Cache Size = 2 × layers × heads × head_dim × seq_len × batch_size × bytes_per_value

    `

    Example (LLaMA 3 8B, fp16, seq_len=8192, batch_size=1):

    `

    2 × 32 × 8 × 128 × 8192 × 1 × 2 bytes = ~1.07 GB

    `

    For a 200K token context: ~26 GB just for KV cache of one request.

    Memory is the Bottleneck

    At long sequences, KV cache dominates GPU memory:

  • Model weights: fixed
  • KV cache: grows linearly with sequence length × batch size
  • This is why long-context inference is expensive and why context window limits exist.

    Prompt Caching (Prefix Caching)

    An extension that caches KV states for repeated prompt prefixes across API requests:

    `

    Request 1: [System Prompt (2000 tokens)] + [User question A]

    → Compute + cache KV for system prompt

    → Compute KV for question A

    Request 2: [Same System Prompt (2000 tokens)] + [User question B]

    → Retrieve cached KV for system prompt (instant, ~90% cost reduction)

    → Compute KV for question B only

    `

    Supported by:

  • Anthropic Claude: explicit cache control (cache_control: {"type": "ephemeral"})
  • OpenAI: automatic prefix caching for repeated prefixes
  • Google Gemini: automatic context caching
  • Savings: up to 90% cost reduction for the cached portion, plus 85% latency reduction on prefill.

    KV Cache Optimizations

    PagedAttention (vLLM)

  • Inspired by OS virtual memory / paging
  • KV cache divided into fixed-size "pages"
  • Pages allocated dynamically, non-contiguous in memory
  • Result: eliminates memory fragmentation → 2–4× higher throughput
  • Foundation of vLLM's efficiency advantage
  • Chunked Prefill

  • Process the prompt in chunks rather than all at once
  • Reduces memory spikes during prefill phase
  • Allows interleaving prefill and decode for better GPU utilization
  • Sliding Window / Eviction

  • For very long contexts, evict old KV entries when memory fills
  • Only keep the most recent W tokens' KV states
  • Used by: Mistral's sliding window attention
  • Multi-Query Attention (MQA) / Grouped Query Attention (GQA)

  • Reduce KV cache size by sharing K,V across multiple query heads
  • GQA (LLaMA 3, Mistral): groups of query heads share one K,V pair
  • Reduces KV cache size by 4–8× with minimal quality loss
  • Practical Impact on Deployment

    | Scenario | KV Cache Implication |

    |----------|---------------------|

    | Short conversations | KV cache small, not a bottleneck |

    | Long documents (RAG) | Large KV cache, consider prompt caching |

    | High concurrency | Multiple users' KV caches compete for memory → batching tradeoffs |

    | Streaming | KV cache grows token by token during generation |

    | Agent loops | Each tool result extends context → cache grows per step |

    KV Cache and Context Window Costs

    Why long-context APIs cost more:

  • Computing attention over a long context: O(n²) for prefill
  • Storing KV cache: O(n) GPU memory
  • Long prompts are "prefill heavy" → high TTFT, high memory
  • Prompt caching makes repeated long prompts ~10× cheaper.

    Related Concepts

  • Inference, Attention, Context Window, Latency, Prompt Caching, Memory, vLLM, PagedAttention

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 11).