KV Cache (Key-Value Cache) — FDE@ProdAI Blog

Definition

The KV Cache (Key-Value Cache) is an optimization that stores the computed attention Key and Value matrices for all previously processed tokens during inference, so they don't need to be recomputed when generating each new token. It is the reason autoregressive generation is feasible at reasonable speed.

Why It's Necessary

In self-attention, every token attends to every previous token. Without caching, generating token N requires re-processing all N-1 previous tokens from scratch — O(N²) total work:

Without KV Cache:

Token 1: process 1 token

Token 2: re-process 2 tokens

Token 3: re-process 3 tokens

...

Token N: re-process N tokens

Total work: O(N²) ← catastrophically slow

With KV Cache:

Prefill: process all P prompt tokens once → store K,V for all P tokens

Token 1: compute K,V for 1 new token only + attend to cached K,V

Token 2: compute K,V for 1 new token only + attend to cached K,V

...

Total decode work: O(P) prefill + O(N) decode ← manageable

What Gets Cached

For each Transformer layer, each attention head, at each sequence position:

K (Key matrix): what the token "advertises" about itself
V (Value matrix): the actual information the token carries

These are computed once and stored in GPU memory.

KV Cache Size = 2 × layers × heads × head_dim × seq_len × batch_size × bytes_per_value

Example (LLaMA 3 8B, fp16, seq_len=8192, batch_size=1):

2 × 32 × 8 × 128 × 8192 × 1 × 2 bytes = ~1.07 GB

For a 200K token context: ~26 GB just for KV cache of one request.

Memory is the Bottleneck

At long sequences, KV cache dominates GPU memory:

Model weights: fixed
KV cache: grows linearly with sequence length × batch size

This is why long-context inference is expensive and why context window limits exist.

Prompt Caching (Prefix Caching)

An extension that caches KV states for repeated prompt prefixes across API requests:

Request 1: [System Prompt (2000 tokens)] + [User question A]

→ Compute + cache KV for system prompt

→ Compute KV for question A

Request 2: [Same System Prompt (2000 tokens)] + [User question B]

→ Retrieve cached KV for system prompt (instant, ~90% cost reduction)

→ Compute KV for question B only

Supported by:

Anthropic Claude: explicit cache control (cache_control: {"type": "ephemeral"})
OpenAI: automatic prefix caching for repeated prefixes
Google Gemini: automatic context caching

Savings: up to 90% cost reduction for the cached portion, plus 85% latency reduction on prefill.

KV Cache Optimizations

PagedAttention (vLLM)

Inspired by OS virtual memory / paging
KV cache divided into fixed-size "pages"
Pages allocated dynamically, non-contiguous in memory
Result: eliminates memory fragmentation → 2–4× higher throughput
Foundation of vLLM's efficiency advantage

Chunked Prefill

Process the prompt in chunks rather than all at once
Reduces memory spikes during prefill phase
Allows interleaving prefill and decode for better GPU utilization

Sliding Window / Eviction

For very long contexts, evict old KV entries when memory fills
Only keep the most recent W tokens' KV states
Used by: Mistral's sliding window attention

Multi-Query Attention (MQA) / Grouped Query Attention (GQA)

Reduce KV cache size by sharing K,V across multiple query heads
GQA (LLaMA 3, Mistral): groups of query heads share one K,V pair
Reduces KV cache size by 4–8× with minimal quality loss

Practical Impact on Deployment

| Scenario | KV Cache Implication |

|----------|---------------------|

| Short conversations | KV cache small, not a bottleneck |

| Long documents (RAG) | Large KV cache, consider prompt caching |

| High concurrency | Multiple users' KV caches compete for memory → batching tradeoffs |

| Streaming | KV cache grows token by token during generation |

| Agent loops | Each tool result extends context → cache grows per step |

KV Cache and Context Window Costs

Why long-context APIs cost more:

Computing attention over a long context: O(n²) for prefill
Storing KV cache: O(n) GPU memory
Long prompts are "prefill heavy" → high TTFT, high memory

Prompt caching makes repeated long prompts ~10× cheaper.

Related Concepts

Inference, Attention, Context Window, Latency, Prompt Caching, Memory, vLLM, PagedAttention