dev.to8 de junio de 2026NUEVO AFECTA AL EXAMEN
Modelo

Prompt caching is the cheapest Claude optimization. Nobody measures it.

Every Claude response carries cache-hit data. Most apps log it nowhere — and pay for it.

---

title: Prompt caching is the cheapest Claude optimization. Nobody measures it.

published: true

canonical_url: https://ferhatatagun.com/blog/prompt-caching-nobody-measures

description: Every Claude response carries cache-hit data. Most apps log it nowhere — and pay for it.

tags: claude, anthropic, llm, observability

cover_image:

---

Pull up the last week of Anthropic API bills from any team shipping a Claude-powered product. Two out of three of them are paying for context they could be reading from cache for one-tenth the price. Most of them don't know it, because the dashboard doesn't tell them and the SDKs don't either — by the time the response lands, the only number anyone looks at is `output_tokens`, and even then mostly when something seems expensive.

The information is in every response. Anthropic puts it in `usage`:

json
"usage": {
  "input_tokens": 312,
  "cache_creation_input_tokens": 4180,
  "cache_read_input_tokens": 0,
  "output_tokens": 187
}

Four numbers. The first time a cached prompt runs you pay 1.25× the input price to *write* the cache. Every subsequent call within the TTL pays 0.1× to *read* it. The ratio between those two lines is the difference between a $3,000/month bill and a $300/month one. And almost no one is graphing it.

**TL;DR**

  • Every Claude response carries cache-hit data in `usage`. Most apps log it nowhere.
  • The first call after a cache miss costs `1.25× input` extra; every hit after costs `0.1× input`. Break-even is two reads.
  • The cache TTL is 5 minutes by default. A request pattern that fires once every six minutes is paying the write penalty forever and getting zero benefit.
  • The fix is observability, not code: graph cache hit ratio over time, alert when it dips, and you'll find the bug before the invoice does.
  • A 150-line browser tool is enough to do this for any project that streams from the Messages API.

What the four numbers actually mean

When you send a request with `cache_control: { type: "ephemeral" }` somewhere in your messages, the API checks if it's seen an identical prefix in the last 5 minutes. There are three outcomes:

1. **Cache miss, new content.** The full prompt is processed normally. `input_tokens` reflects the uncached portion; `cache_creation_input_tokens` reflects what got written into cache for next time.

2. **Cache hit.** The cached prefix is read at 10% the price. `cache_read_input_tokens` shows what was read; `input_tokens` is just the new suffix.

3. **TTL expired.** Same shape as a miss — you pay the creation surcharge again.

So a single response tells you exactly which of these three happened. Not "approximately." Exactly. Per request. For free.

The pricing math (Sonnet 4.5, June 2026) shapes up like this for a 5,000-token system prompt that gets queried once and then again four minutes later:

| Scenario | First call | Second call | Total |

|-----------------------|------------------------|----------------------

Leer artículo completo en dev.to