Anthropic's prompt caching can eliminate 85-90% of your input token cost on repeated system prompts and tool definitions. Here's exactly how to enable it, what the gotchas are, and where it actually saves money.
Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt (roughly 2,800 tokens of rules, tool definitions, and examples) that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day.
That's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior.
Here's exactly how prompt caching works and how to set it up without the gotchas.
---
Anthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them.
The pricing structure (as of mid-2026 on Claude 3.5 Sonnet):
The cache lasts **5 minutes** between requests (with the TTL resetting on each hit). For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win.
---
The key is the `cache_control` block. You add it as a "breakpoint" at the end of any message block you want cached. The API caches everything **up to and including** that breakpoint.
import anthropic
client = anthropic.Anthropic()
# Your long system prompt - tool definitions, rules, examples, etc.
SYSTEM_PROMPT = """
You are a support agent for Acme Corp...
[2,800 tokens of rules, tool definitions, persona, examples]
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # <-- this is the entire setup
}
],
messages=[
{"role": "user", "content": user_message}
]
)
# Check what actually happened
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")The `cache_creation_input_tokens` field tells you a cache was written (you pay the 25% premium). On subsequent calls within 5 minutes, `cache_read_input_tokens` will be populated instead, and you pay $0.30/M instead of $3.00/M.
---
**High-ROI scenarios:**
1. **Large system prompts repeated on every call.** If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive.
2. **Tool defin