Provider-native caching
Caching primitives built into the AI provider (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) rather than gateway-layer caching.
How it works
Provider-native caching is a server-side optimisation by the LLM provider (Anthropic, OpenAI, increasingly others) that caches the prefix-attention computation of repeated prompt prefixes. When a request arrives with a prompt prefix the provider has seen recently, the provider serves the cached attention state rather than recomputing it — and discounts the input tokens covered by the cache hit. The discount is substantial: Anthropic discounts cache-read tokens to 10% of the normal input price (a 90% discount); OpenAI discounts cached tokens to 50%.
The mechanic is distinct from gateway-side response caching. Response caching avoids the model call entirely (return a cached response instead). Provider-native caching makes the model call cheaper — the model still runs and generates fresh output, but the input-token cost is reduced. The two layers stack cleanly.
The two implementations
Anthropic uses explicit cache-control markers. The caller marks portions of the request (typically the system message and any static context) with cache_control: { type: "ephemeral" }. The first request that creates a cache entry pays a 25% write premium above normal input price; subsequent reads within the 5-minute default TTL (or 1-hour extended TTL) pay the 10% rate. Break-even arrives at the second request that hits the cache on the same prefix.
OpenAIcaches automatically with no caller-side configuration. Cache engages on prompts ≥1,024 tokens; cached tokens come back in the response's usage block as `cached_tokens`, billed at 50% of normal input price. No write premium, no marker needed.
The pass-through question
When an AI gateway proxies provider calls, the discount goes to the gateway, not directly to the customer. The gateway has two options: pass the discount through (customer bill drops on cache hits) or absorb it as margin (customer pays the gateway's standard markup regardless of provider cache state). Different gateways handle this differently. Worth verifying when evaluating.
Prism passes provider-native cache discounts through. The customer's bill is calculated against the discounted base — `cache_read_tokens` from Anthropic or `cached_tokens` from OpenAI are billed at the provider's discounted rate, then Prism's standard markup applies on top of the discounted figure. The `X-Prism-Native-Cache-Saved-Cents` response header surfaces the actual saving per request.
When it engages
Provider-native caching pays off any time a request carries a stable prefix of meaningful length (typically system message + retrieved context + tool definitions ≥1,024 tokens on OpenAI; a few hundred tokens with the cache-control marker on Anthropic). Most production LLM workloads carry stable prefixes — customer support bots with long instructional system messages, RAG applications with retrieved-context preambles, multi-turn agents with stable tool definitions. The cache discount applies to the prefix portion of every such request.