dev.to1 de julio de 2026NUEVO AFECTA AL EXAMEN
ModeloAPI

Prompt caching cut my Claude API bill by 85%. Here's the exact setup.

Anthropic's prompt caching can eliminate 85-90% of your input token cost on repeated system prompts and tool definitions. Here's exactly how to enable it, what the gotchas are, and where it actually saves money.

Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt (roughly 2,800 tokens of rules, tool definitions, and examples) that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day.

That's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior.

Here's exactly how prompt caching works and how to set it up without the gotchas.

---

What prompt caching actually does (and doesn't do)

Anthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them.

The pricing structure (as of mid-2026 on Claude 3.5 Sonnet):

  • **Normal input tokens:** $3.00 per million
  • **Cache write (first use, or cache miss):** $3.75 per million (a 25% premium to write the cache)
  • **Cache read (cache hit):** $0.30 per million (90% discount vs. normal)

The cache lasts **5 minutes** between requests (with the TTL resetting on each hit). For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win.

---

The exact API call

The key is the `cache_control` block. You add it as a "breakpoint" at the end of any message block you want cached. The API caches everything **up to and including** that breakpoint.

python
import anthropic

client = anthropic.Anthropic()

# Your long system prompt - tool definitions, rules, examples, etc.
SYSTEM_PROMPT = """
You are a support agent for Acme Corp...
[2,800 tokens of rules, tool definitions, persona, examples]
"""

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # <-- this is the entire setup
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)

# Check what actually happened
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

The `cache_creation_input_tokens` field tells you a cache was written (you pay the 25% premium). On subsequent calls within 5 minutes, `cache_read_input_tokens` will be populated instead, and you pay $0.30/M instead of $3.00/M.

---

Where it saves money and where it doesn't

**High-ROI scenarios:**

1. **Large system prompts repeated on every call.** If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive.

2. **Tool defin

Leer artículo completo en dev.to