Definition
The KV Cache (Key-Value Cache) is an optimization that stores the computed attention Key and Value matrices for all previously processed tokens during inference, so they don't need to be recomputed when generating each new token. It is the reason autoregressive generation is feasible at reasonable speed.
Why It's Necessary
In self-attention, every token attends to every previous token. Without caching, generating token N requires re-processing all N-1 previous tokens from scratch — O(N²) total work:
`
Without KV Cache:
Token 1: process 1 token
Token 2: re-process 2 tokens
Token 3: re-process 3 tokens
...
Token N: re-process N tokens
Total work: O(N²) ← catastrophically slow
`
With KV Cache:
`
Prefill: process all P prompt tokens once → store K,V for all P tokens
Token 1: compute K,V for 1 new token only + attend to cached K,V
Token 2: compute K,V for 1 new token only + attend to cached K,V
...
Total decode work: O(P) prefill + O(N) decode ← manageable
`
What Gets Cached
For each Transformer layer, each attention head, at each sequence position:
- K (Key matrix): what the token "advertises" about itself
- V (Value matrix): the actual information the token carries
- Model weights: fixed
- KV cache: grows linearly with sequence length × batch size
- Anthropic Claude: explicit cache control (
cache_control: {"type": "ephemeral"}) - OpenAI: automatic prefix caching for repeated prefixes
- Google Gemini: automatic context caching
- Inspired by OS virtual memory / paging
- KV cache divided into fixed-size "pages"
- Pages allocated dynamically, non-contiguous in memory
- Result: eliminates memory fragmentation → 2–4× higher throughput
- Foundation of vLLM's efficiency advantage
- Process the prompt in chunks rather than all at once
- Reduces memory spikes during prefill phase
- Allows interleaving prefill and decode for better GPU utilization
- For very long contexts, evict old KV entries when memory fills
- Only keep the most recent W tokens' KV states
- Used by: Mistral's sliding window attention
- Reduce KV cache size by sharing K,V across multiple query heads
- GQA (LLaMA 3, Mistral): groups of query heads share one K,V pair
- Reduces KV cache size by 4–8× with minimal quality loss
- Computing attention over a long context: O(n²) for prefill
- Storing KV cache: O(n) GPU memory
- Long prompts are "prefill heavy" → high TTFT, high memory
- Inference, Attention, Context Window, Latency, Prompt Caching, Memory, vLLM, PagedAttention
These are computed once and stored in GPU memory.
`
KV Cache Size = 2 × layers × heads × head_dim × seq_len × batch_size × bytes_per_value
`
Example (LLaMA 3 8B, fp16, seq_len=8192, batch_size=1):
`
2 × 32 × 8 × 128 × 8192 × 1 × 2 bytes = ~1.07 GB
`
For a 200K token context: ~26 GB just for KV cache of one request.
Memory is the Bottleneck
At long sequences, KV cache dominates GPU memory:
This is why long-context inference is expensive and why context window limits exist.
Prompt Caching (Prefix Caching)
An extension that caches KV states for repeated prompt prefixes across API requests:
`
Request 1: [System Prompt (2000 tokens)] + [User question A]
→ Compute + cache KV for system prompt
→ Compute KV for question A
Request 2: [Same System Prompt (2000 tokens)] + [User question B]
→ Retrieve cached KV for system prompt (instant, ~90% cost reduction)
→ Compute KV for question B only
`
Supported by:
Savings: up to 90% cost reduction for the cached portion, plus 85% latency reduction on prefill.
KV Cache Optimizations
PagedAttention (vLLM)
Chunked Prefill
Sliding Window / Eviction
Multi-Query Attention (MQA) / Grouped Query Attention (GQA)
Practical Impact on Deployment
| Scenario | KV Cache Implication |
|----------|---------------------|
| Short conversations | KV cache small, not a bottleneck |
| Long documents (RAG) | Large KV cache, consider prompt caching |
| High concurrency | Multiple users' KV caches compete for memory → batching tradeoffs |
| Streaming | KV cache grows token by token during generation |
| Agent loops | Each tool result extends context → cache grows per step |
KV Cache and Context Window Costs
Why long-context APIs cost more:
Prompt caching makes repeated long prompts ~10× cheaper.