Last updated:

Prompt caching

Reusing the model's processing of a shared prompt prefix across multiple requests to cut input-token cost.

How it works

Prompt caching is a provider-side optimization that reduces the cost and latency of repeated long prompt prefixes. When a request arrives at the provider's inference servers, the provider hashes the leading portion of the prompt (typically the system message and any static context) and checks an internal cache. If the prefix has been seen recently, the provider serves the cached attention state rather than recomputing it from scratch — input tokens covered by the cache hit are billed at a discounted rate, often 50% to 90% off the normal input price.

The mechanic is distinct from response-level caching. A prompt cache doesn't return a cached response — the LLM still generates fresh output for the user's actual question. What it caches is the expensive prefix-attention computation that's the same across every request sharing that system prompt. This matters because most production LLM traffic carries a stable system message of several hundred to several thousand tokens; without prompt caching every request pays full input-token price for re-processing the same prefix.

Two implementations: Anthropic vs OpenAI

Anthropic uses explicit cache-control markers. The caller marks portions of the prompt with a cache_control: { type: "ephemeral" } block; the provider caches that block for ~5 minutes (or up to 1 hour with the extended TTL option). Cache-hit tokens are billed at 10% of the normal input price (a 90% discount); the first request that creates a cache entry pays a one-time write premium of 25% above normal input price. Break-even is typically within the second request on a stable system prompt.

OpenAI caches automatically — no marker needed. When a prompt prefix matches a recent request, the provider returns a cached_tokens count in the usage block of the response; cached tokens are billed at 50% of the normal input price. Minimum cached-prefix length is 1,024 tokens, after which caching kicks in at 128-token boundaries. There's no write premium; the discount is pure savings.

When it matters

Prompt caching pays off any time a request carries a stable system prompt or static context of more than a few hundred tokens — which describes the majority of production LLM workloads. Retrieval-augmented chat assistants with cached document context, multi-turn agents with stable tool definitions, customer-support bots with long instructional prefixes, and structured-output generators with fixed schema descriptions all benefit. On Anthropic, customers with stable prefixes ≥1,024 tokens see input-token cost cuts of 60-80% within the first day of deploying prompt caching; on OpenAI, the automatic 50% discount on cached input lands without any code change.

It doesn't pay off when prompts are short, when system content varies per request, or when the request pattern is too sparse to keep the cache warm. The 5-minute Anthropic cache expires fast under low traffic; OpenAI's automatic cache has similar TTL behavior. Workloads of one request every 10 minutes won't see the benefit even when the prompt is stable.

Pass-through vs absorbed savings

AI gateways and proxies sit between the application and the provider, which raises a billing question: does the gateway keep the prompt-caching discount as margin, or pass it through to the customer? The honest answer varies. Some proxies abstract the cached-token count out of their billing model entirely and customers pay the gateway's standard markup on every input token regardless of cache state. Others pass the discount through by computing customer cost against the same cached/uncached token breakdown the provider returned. Worth checking when evaluating any gateway.

Prism passes prompt-caching savings through. When a request returns with cache_read_tokens from Anthropic or cached_tokens from OpenAI, the customer's cost is calculated against the discounted base — then Prism's markup applies on top of the discounted figure. Net effect: when the provider gives a 90% discount on cached input, the customer's bill drops by ~90% on that input slice too. The X-Prism-Native-Cache-Saved-Cents response header surfaces the actual saving per request.

Operational notes

The prefix-matching algorithm is provider-internal and not exposed; what works in practice is keeping the system prompt and any stable context as the very first messages in the array, then appending the user message last. Even a one-character drift in the leading content invalidates the cache hit. For Anthropic, marking system content with the ephemeral cache-control block is mandatory for the cache to engage; for OpenAI, no marker is needed but the prefix must be at least 1,024 tokens. Both implementations skip caching on short prefixes.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

How is prompt caching different from semantic caching?
Different layer entirely. Prompt caching (provider-native) caches the prefix attention state on the provider's side and discounts the input tokens; the LLM still generates fresh output for each request. Semantic caching (gateway-side) caches the full response keyed by embedded prompt similarity, so a cache hit returns the previous response without calling the LLM. The two stack cleanly — prompt caching reduces the cost of the requests that DO reach the model, semantic caching avoids reaching the model at all.
Do I need to do anything to enable OpenAI's prompt caching?
No code change needed. OpenAI's prompt cache is automatic on prompts ≥1,024 tokens; the discount appears in the usage block of the response as cached_tokens billed at 50% of the normal input price. The only requirement is that your system prompt + leading context is stable across requests — even minor variations like a timestamp in the prefix invalidate the cache hit.
What's the break-even on Anthropic prompt caching given the 25% write premium?
On most production workloads, break-even arrives at the second request that hits the cache. The first request pays 1.25x normal input price (a 25% premium for the write), but every subsequent hit within the 5-minute TTL pays 0.10x. Two hits net out to (1.25 + 0.10) ÷ 2 = 0.675x normal cost, already a 32% saving. Three hits drop the average to 0.483x, around 52% saved. The longer the cache stays warm, the closer total cost approaches the steady-state 0.10x.
Does prompt caching work with the OpenAI Chat Completions streaming API?
Yes. The cached_tokens count is returned in the final usage chunk of the stream (with stream_options.include_usage set). Streaming and prompt caching are independent — the cache discount applies regardless of whether the response is streamed or returned as a single JSON object.