Prompt caching in production — what 60-72% Claude savings actually looks like
Anonymized cost data from five production builds. When prompt caching fires, when it doesn't, the four patterns we ship by default, and the math we actually measured.
Prompt caching in production — what 60-72% Claude savings actually looks like
Anonymized cost data from 5 builds: when prompt caching fires, when it doesn't, and the four patterns we use.
Across our five shipped builds, prompt caching cut Claude spend 60-72% on the workloads where the system prompt was static or near-static. The savings showed up as soon as the same system prompt was reused at least three times within five minutes. We default to prompt caching on every Wolrix AI build that ships with claude-sonnet-4-6 or higher. Where it didn't fire — short-context one-shots, RAG with high-cardinality retrieval, dynamic personalization that mutates the prompt — we stopped fighting it and routed those workloads to gpt-4o or gemini-flash instead.
What prompt caching is, briefly
Anthropic's prompt caching feature lets you mark sections of a prompt as cache_control. The first call writes those tokens to a 5-minute cache; subsequent calls within the window read from the cache at roughly 10% of the input-token price. The cache is keyed on exact byte equality and walks the prompt left-to-right — the first cache miss invalidates everything after it.
That last detail is the whole game. If you put your static system prompt at the top and your variable user message at the bottom, caching fires correctly. If you sprinkle dynamic content into the middle of the system prompt — a timestamp, a user-specific greeting, a session ID — the cache never warms.
The four patterns we ship by default
1. Static system prompt
The easiest win. A multi-thousand-token system prompt that defines persona, tools, output schema, and constraints, marked cache_control: ephemeral. The first request pays full price; the next 4 minutes 59 seconds of requests pay 10%. On the MSP triage build, the system prompt was 4,800 tokens of routing rules, response templates, and tool definitions. Caching dropped our per-message cost from $0.0145 to $0.0042 at sustained throughput — a 71% cut.
2. Cached domain knowledge for RAG-style retrieval
When the retrieved corpus is bounded (say, the 40-page operations manual a healthcare team references on every call), we cache the whole corpus. It looks like RAG from the outside but it's actually just a long-context prompt with the entire knowledge base cached at the top. For corpora under ~50k tokens we've consistently found this beats vector-DB retrieval on quality and cost. The cache hit rate runs 85-95% across a busy session.
3. Long document chunks
For document-extraction workflows — invoice parsing, PDF triage, contract review — we cache the schema, the field definitions, and the few-shot examples. The variable part is the document itself. On the legal-IP build, the schema + examples ran 6,200 tokens. Cache hit rate after the warm-up call was effectively 100% across each batch.
4. Few-shot example libraries
For classification and routing tasks where accuracy matters, we maintain a library of 20-40 labeled few-shot examples. The library lives at the top of the prompt, cached. Adding examples becomes cheap; the cost of the few-shot pattern decouples from the cost of inference. On the dispatch-SMS build, this is what made it economical to run Claude on every single inbound ticket.
Where prompt caching didn't fire — and what we ship instead
Three workload shapes where caching either didn't help or actively hurt:
- Sub-5-minute gaps with rare traffic. A low-volume internal tool that fires every 8-15 minutes never reuses the cached prompt — it expires before the next call. Net: paying the cache-write premium with no payoff. We pulled caching off these endpoints.
- Dynamic per-call personalization in the system prompt. If you splice a username, timestamp, or session ID into the system prompt itself, cache hits cap at zero. Move the dynamic content into the user message, or accept that this workload is cache-hostile.
- Streaming agent loops with branching tool calls. When the prompt mutates mid-turn (tool results getting spliced back in), the cache invalidates at each step. We still use caching here for the static front, but the savings drop to 30-40%.
For the workloads where caching didn't help and the volume was high, we routed to gpt-4o (cheaper per-token input, no cache premium) or gemini-flash (cheapest by 5-10x for sub-quality-critical work). This is what multi-model routing actually means in practice — not picking a winner per build, but picking a winner per workload shape inside the build.
The cost math we actually measured
Across the five production builds, monthly Claude spend before and after caching (anonymized to ranges):
The savings aren't hypothetical and they aren't one-off. They show up on the next billing cycle and they keep showing up. The only build where caching didn't pull its weight was a low-traffic internal admin tool — the cache expired between calls and we just turned it off.
What you need to know before turning it on
- Cache breakpoints matter. Anthropic allows up to four
cache_controlbreakpoints per request. Put them at the boundary between sections that change at different cadences — never inside a section. - First request is more expensive. The cache-write tier costs ~125% of normal input pricing. Plan for the warm-up: don't cache on a single-shot path.
- 5-minute TTL is short. Build for sustained traffic, not bursts. For burst workloads, you can preload the cache with a warm-up call when a session starts.
- Token estimation lies a little. The
cache_read_input_tokensin the response is the source of truth. Bake it into your cost telemetry from day one. - Observability is non-optional. Cache hit rate is the number that tells you if your prompt design is right. We log
cache_read_input_tokens,cache_creation_input_tokens, andinput_tokenson every call and put them in a dashboard. If hit rate drops below 70% on a busy workload, the prompt has drifted.
Why this matters for clients
Most AI dev shops ship a Claude integration and hand you the API key. Six weeks later you get a bill that's 3x what you expected. That's not a model problem; that's a prompt-design problem. We default to caching on every multi-tenant build, expose per-tenant cost telemetry in the admin, and treat the Anthropic invoice the same way we treat the Stripe invoice — it's part of the production envelope, not a footnote.
If you have an AI build that's costing more than it should, this is the first thing we'd look at on a free architecture audit. The fix is usually under a day.
Wolrix is led by Uros — one technical lead per project, with proven specialists when the scope demands. If your project fits the shape — $10K-$50K, 2-8 weeks, NDA-first — book a 15-min intro on Calendly or get a free architecture audit at /free-audit.
All numbers in this post are anonymized to ranges from real shipped work. Cache hit rate, savings percentage, and monthly spend reflect actual measured production data across five builds 2024-2026.