Definition
Prompt caching is an optimization where the LLM provider precomputes and stores the KV (key-value) cache for a repeated portion of the prompt — typically the system prompt or large context documents — so that subsequent requests reuse this computation instead of reprocessing it from scratch.
The Problem It Solves
In production applications, many requests share the same large prefix:
- A 10,000-token system prompt sent with every API call
- A 200-page document sent with every question about it
- A large codebase sent with every code review request
- Without caching: $0.03 per request × 10,000 requests = $300
- With caching: $0.03 first request + $0.003 × 9,999 remaining = $33
- Cache write: 25% more than base input price (one-time cost)
- Cache read: 10% of base input price (paid on every cache hit)
- No explicit opt-in required
- 50% discount on cached input tokens
- Cache has a TTL (5–60 minutes)
- Works best for system prompts and long repeated contexts
- The cached portion changes even by one character
- The cache TTL expires (5 minutes for OpenAI, configurable for Gemini)
- API version changes affect tokenization
- KV Cache, Context Window, Latency, Token Costs, System Prompt, API
Without caching: pay full input token cost + full prefill compute on EVERY request.
With caching: pay full cost ONCE, then ~10% of input cost on cache hits.
Cost and Latency Savings
| Metric | Without Cache | With Cache (hit) |
|--------|--------------|-----------------|
| Input token cost | 100% | ~10% of cached portion |
| Prefill latency | Full compute | ~85% reduction |
| TTFT | Baseline | Much lower on cache hits |
Example: 10,000-token system prompt at $3/M tokens:
How It Works Technically
1. On the first request, the provider processes the entire prompt and stores KV states for the cacheable prefix
2. On subsequent requests with the same prefix, the stored KV states are retrieved
3. Only the new portion (user query + any changed parts) is computed
4. The response is generated using the cached + new KV states
The cache is keyed on the exact byte sequence — any change to the cached portion invalidates the cache.
Provider Implementations
Anthropic Claude (Explicit Cache Control)
Developers explicitly mark what to cache using cache_control:
`python
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer. [10,000 token instructions]",
"cache_control": {"type": "ephemeral"} # ← cache this
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "[Large codebase - 50,000 tokens]",
"cache_control": {"type": "ephemeral"} # ← cache this too
},
{"type": "text", "text": "Review the authentication module."}
]
}
]
)
Check cache usage
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
`
Cache pricing (Claude):
OpenAI (Automatic)
OpenAI automatically caches prompts when the same prefix appears in multiple requests:
Google Gemini (Explicit Context Caching)
`python
import google.generativeai as genai
cached_content = genai.caching.CachedContent.create(
model="gemini-1.5-pro",
contents=[large_document],
ttl=datetime.timedelta(hours=1)
)
model = genai.GenerativeModel.from_cached_content(cached_content)
response = model.generate_content("What is the main finding?")
`
What to Cache (Best Practices)
High Value to Cache
| Content | Reason |
|---------|--------|
| System prompts > 1,000 tokens | Reused every request |
| Large reference documents | Sent with every Q&A request |
| Codebase for code review | Many questions about same code |
| Conversation history (long) | Grows with conversation |
| Tool/function definitions | Same set for all requests |
Cannot/Should Not Cache
| Content | Reason |
|---------|--------|
| User-specific content | Changes per user |
| Dynamic data (today's prices, live info) | Changes over time |
| Request-specific context | Unique per request |
| Short system prompts (<1K tokens) | Overhead not worth it |
Cache Invalidation
The cache is invalidated when:
Best practice: structure your prompt so the stable portion comes FIRST:
`
[Large stable system prompt] ← cache this
[Large stable document] ← cache this
[Dynamic user query] ← NOT cached (changes each request)
`
Compound Effect: Multi-Turn with Caching
In a long conversation, cache the system prompt + conversation history up to the current turn:
`
Turn 1: cache system (10K tokens) + user message
Turn 2: read cache + turn 1 + user message
Turn 3: read cache + turns 1-2 + user message
...
Turn N: 90% cost reduction on the system prompt portion for every turn
`