Last updated:

Provider-native caching

Caching primitives built into the AI provider (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) rather than gateway-layer caching.

How it works

Provider-native caching is a server-side optimisation by the LLM provider (Anthropic, OpenAI, increasingly others) that caches the prefix-attention computation of repeated prompt prefixes. When a request arrives with a prompt prefix the provider has seen recently, the provider serves the cached attention state rather than recomputing it — and discounts the input tokens covered by the cache hit. The discount is substantial: Anthropic discounts cache-read tokens to 10% of the normal input price (a 90% discount); OpenAI discounts cached tokens to 50%.

The mechanic is distinct from gateway-side response caching. Response caching avoids the model call entirely (return a cached response instead). Provider-native caching makes the model call cheaper — the model still runs and generates fresh output, but the input-token cost is reduced. The two layers stack cleanly.

The two implementations

Anthropic uses explicit cache-control markers. The caller marks portions of the request (typically the system message and any static context) with cache_control: { type: "ephemeral" }. The first request that creates a cache entry pays a 25% write premium above normal input price; subsequent reads within the 5-minute default TTL (or 1-hour extended TTL) pay the 10% rate. Break-even arrives at the second request that hits the cache on the same prefix.

OpenAIcaches automatically with no caller-side configuration. Cache engages on prompts ≥1,024 tokens; cached tokens come back in the response's usage block as `cached_tokens`, billed at 50% of normal input price. No write premium, no marker needed.

The pass-through question

When an AI gateway proxies provider calls, the discount goes to the gateway, not directly to the customer. The gateway has two options: pass the discount through (customer bill drops on cache hits) or absorb it as margin (customer pays the gateway's standard markup regardless of provider cache state). Different gateways handle this differently. Worth verifying when evaluating.

Prism passes provider-native cache discounts through. The customer's bill is calculated against the discounted base — `cache_read_tokens` from Anthropic or `cached_tokens` from OpenAI are billed at the provider's discounted rate, then Prism's standard markup applies on top of the discounted figure. The `X-Prism-Native-Cache-Saved-Cents` response header surfaces the actual saving per request.

When it engages

Provider-native caching pays off any time a request carries a stable prefix of meaningful length (typically system message + retrieved context + tool definitions ≥1,024 tokens on OpenAI; a few hundred tokens with the cache-control marker on Anthropic). Most production LLM workloads carry stable prefixes — customer support bots with long instructional system messages, RAG applications with retrieved-context preambles, multi-turn agents with stable tool definitions. The cache discount applies to the prefix portion of every such request.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

How is this different from prompt caching as Anthropic and OpenAI describe it?
Same thing, different name. "Provider-native caching" is the gateway-evaluator-side term for what providers call "prompt caching" (Anthropic) or "cached tokens" (OpenAI). The wording differs; the mechanic is the same.
Do I need to do anything special to enable this?
On Anthropic, yes — mark portions of your prompt with cache_control: { type: 'ephemeral' }. On OpenAI, no — caching is automatic on prompts ≥1,024 tokens. The Anthropic write premium (25% above normal) makes the choice deliberate: if you don't expect repeated hits within the 5-minute window, skip the marker.
Does the gateway have to do anything, or is this transparent?
The gateway has to pass cache-control markers through cleanly (for Anthropic) and read cached-token counts back from response usage blocks (for both). And on the billing side, the gateway has to decide whether to pass the discount to the customer or absorb it. Prism does both — markers pass through, discounts pass to the customer.
Is this the same as semantic caching?
No — provider-native caching is server-side prefix caching that discounts the model call. Semantic caching is gateway-side response caching that avoids the model call entirely (return a cached response based on embedding similarity). They solve different problems and stack: provider-native discounts the calls that do reach the model; semantic prevents many calls from reaching the model in the first place.