Definition

Prompt caching is an optimization where the LLM provider precomputes and stores the KV (key-value) cache for a repeated portion of the prompt — typically the system prompt or large context documents — so that subsequent requests reuse this computation instead of reprocessing it from scratch.

The Problem It Solves

In production applications, many requests share the same large prefix:

A 10,000-token system prompt sent with every API call
A 200-page document sent with every question about it
A large codebase sent with every code review request

Without caching: pay full input token cost + full prefill compute on EVERY request.

With caching: pay full cost ONCE, then ~10% of input cost on cache hits.

Cost and Latency Savings

| Metric | Without Cache | With Cache (hit) |

|--------|--------------|-----------------|

| Input token cost | 100% | ~10% of cached portion |

| Prefill latency | Full compute | ~85% reduction |

| TTFT | Baseline | Much lower on cache hits |

Example: 10,000-token system prompt at $3/M tokens:

Without caching: $0.03 per request × 10,000 requests = $300
With caching: $0.03 first request + $0.003 × 9,999 remaining = $33

How It Works Technically

1. On the first request, the provider processes the entire prompt and stores KV states for the cacheable prefix

2. On subsequent requests with the same prefix, the stored KV states are retrieved

3. Only the new portion (user query + any changed parts) is computed

4. The response is generated using the cached + new KV states

The cache is keyed on the exact byte sequence — any change to the cached portion invalidates the cache.

Provider Implementations

Anthropic Claude (Explicit Cache Control)

Developers explicitly mark what to cache using cache_control:

`python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=1024,

system=[

{

"type": "text",

"text": "You are an expert code reviewer. [10,000 token instructions]",

"cache_control": {"type": "ephemeral"} # ← cache this

}

messages=[

{

"role": "user",

"content": [

{

"type": "text",

"text": "[Large codebase - 50,000 tokens]",

"cache_control": {"type": "ephemeral"} # ← cache this too

{"type": "text", "text": "Review the authentication module."}

]

}

]

)

Check cache usage

usage = response.usage

print(f"Cache read tokens: {usage.cache_read_input_tokens}")

print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

Cache pricing (Claude):

Cache write: 25% more than base input price (one-time cost)
Cache read: 10% of base input price (paid on every cache hit)

OpenAI (Automatic)

OpenAI automatically caches prompts when the same prefix appears in multiple requests:

No explicit opt-in required
50% discount on cached input tokens
Cache has a TTL (5–60 minutes)
Works best for system prompts and long repeated contexts

Google Gemini (Explicit Context Caching)

`python

import google.generativeai as genai

cached_content = genai.caching.CachedContent.create(

model="gemini-1.5-pro",

contents=[large_document],

ttl=datetime.timedelta(hours=1)

)

model = genai.GenerativeModel.from_cached_content(cached_content)

response = model.generate_content("What is the main finding?")

What to Cache (Best Practices)

High Value to Cache

| Content | Reason |

|---------|--------|

| System prompts > 1,000 tokens | Reused every request |

| Large reference documents | Sent with every Q&A request |

| Codebase for code review | Many questions about same code |

| Conversation history (long) | Grows with conversation |

| Tool/function definitions | Same set for all requests |

Cannot/Should Not Cache

| Content | Reason |

|---------|--------|

| User-specific content | Changes per user |

| Dynamic data (today's prices, live info) | Changes over time |

| Request-specific context | Unique per request |

| Short system prompts (<1K tokens) | Overhead not worth it |

Cache Invalidation

The cache is invalidated when:

The cached portion changes even by one character
The cache TTL expires (5 minutes for OpenAI, configurable for Gemini)
API version changes affect tokenization

Best practice: structure your prompt so the stable portion comes FIRST:

[Large stable system prompt] ← cache this

[Large stable document] ← cache this

[Dynamic user query] ← NOT cached (changes each request)

Compound Effect: Multi-Turn with Caching

In a long conversation, cache the system prompt + conversation history up to the current turn:

Turn 1: cache system (10K tokens) + user message

Turn 2: read cache + turn 1 + user message

Turn 3: read cache + turns 1-2 + user message

...

Turn N: 90% cost reduction on the system prompt portion for every turn

Related Concepts

KV Cache, Context Window, Latency, Token Costs, System Prompt, API

Prompt Caching