Anthropic prompt caching, explained: cache_control markers, the two-tier write premium, and when it actually pays off

How Anthropic's prompt cache works mechanically — the ephemeral cache_control marker, the two-tier write premium (1.25x for 5-min TTL, 2x fo

Anthropic's prompt caching is one of the highest-ROI LLM cost-reduction techniques shipped in the last two years, but the mechanics aren't immediately obvious from the docs. The pricing is non-uniform — a write premium on first writes balanced against a 90% discount on reads — and the marker syntax requires explicit opt-in rather than firing automatically the way OpenAI's does. **The summary: tag the stable portion of your prompt with `cache_control: { type: "ephemeral" }`, pay 1.25x normal input price on the first request (5-minute TTL) or 2x (1-hour TTL), then 0.10x on every subsequent request within the cache TTL. Break-even on the 5-minute TTL arrives at the second cache hit; the 1-hour TTL takes a few more hits to pay back but survives much longer between requests. For most production workloads with a system prompt over a few hundred tokens, the discount kicks in by the second customer interaction.** This post walks through the mechanics, the math, the gotchas, and the production patterns that turn the marker into actual savings.

The parent guide [AI API caching](/guides/ai-api-caching) covers the broader caching strategy; this article goes one level into Anthropic's specific implementation.

What it caches and why

Prompt caching is provider-side prefix-attention caching. When you send a request to Anthropic with `cache_control: { type: "ephemeral" }` on part of the prompt, Anthropic hashes the leading content up to that marker, checks an internal cache, and serves the cached attention state if a match exists. The actual model run still happens — Claude still generates the response token-by-token — but the expensive prefix-attention computation is skipped.

The "cache" here is not the response. It's the work the model does to encode the static context into the model's internal representation. Most production LLM workloads carry a long stable prefix (system prompt + retrieved context + tool definitions) followed by a short variable suffix (the user message). Re-encoding the stable prefix on every request is wasted compute. Anthropic charges less for the cached portion because it's doing less work.

The pricing math

The numbers that matter:

| Token category | Price multiplier (vs base input price) | Notes |

|---|---|---|

| Normal input (uncached) | 1.0x | Standard input pricing |

| Cache write — 5-minute TTL (default) | 1.25x | 25% premium for the short-window cache |

| Cache write — 1-hour TTL (extended) | 2.0x | 100% premium for the long-window cache |

| Cache read (subsequent requests within TTL) | 0.10x | The 90% discount — the wedge, same for either TTL |

| Output | normal output pricing | Unchanged |

The break-even threshold is when cumulative savings from cache reads exceed the one-time write premium. On the 5-minute TTL, two cache hits net out as (1.25 + 0.10) / 2 = 0.675x — already a 32.5% saving on the cached portion. Three hits drops the average to 0.483x (a 52% saving). The asymptotic limit as the cache stays warm forev

Read full article on dev.to

// related articles

Cognizant Anthropic

AnthropicJul 28

Import AI 466: The bitter lesson for robotics, AIs complete week-long programming tasks; and OpenAI's accidental AI hacker

Import AIJul 27

Karparthy removed Anthropic from his bio

RedditJul 26