Prompt caching
Reusing the model's processing of a shared prompt prefix across multiple requests to cut input-token cost.
How it works
Prompt caching is a provider-side optimization that reduces the cost and latency of repeated long prompt prefixes. When a request arrives at the provider's inference servers, the provider hashes the leading portion of the prompt (typically the system message and any static context) and checks an internal cache. If the prefix has been seen recently, the provider serves the cached attention state rather than recomputing it from scratch — input tokens covered by the cache hit are billed at a discounted rate, often 50% to 90% off the normal input price.
The mechanic is distinct from response-level caching. A prompt cache doesn't return a cached response — the LLM still generates fresh output for the user's actual question. What it caches is the expensive prefix-attention computation that's the same across every request sharing that system prompt. This matters because most production LLM traffic carries a stable system message of several hundred to several thousand tokens; without prompt caching every request pays full input-token price for re-processing the same prefix.
Two implementations: Anthropic vs OpenAI
Anthropic uses explicit cache-control markers. The caller marks portions of the prompt with a cache_control: { type: "ephemeral" } block; the provider caches that block for ~5 minutes (or up to 1 hour with the extended TTL option). Cache-hit tokens are billed at 10% of the normal input price (a 90% discount); the first request that creates a cache entry pays a one-time write premium of 25% above normal input price. Break-even is typically within the second request on a stable system prompt.
OpenAI caches automatically — no marker needed. When a prompt prefix matches a recent request, the provider returns a cached_tokens count in the usage block of the response; cached tokens are billed at 50% of the normal input price. Minimum cached-prefix length is 1,024 tokens, after which caching kicks in at 128-token boundaries. There's no write premium; the discount is pure savings.
When it matters
Prompt caching pays off any time a request carries a stable system prompt or static context of more than a few hundred tokens — which describes the majority of production LLM workloads. Retrieval-augmented chat assistants with cached document context, multi-turn agents with stable tool definitions, customer-support bots with long instructional prefixes, and structured-output generators with fixed schema descriptions all benefit. On Anthropic, customers with stable prefixes ≥1,024 tokens see input-token cost cuts of 60-80% within the first day of deploying prompt caching; on OpenAI, the automatic 50% discount on cached input lands without any code change.
It doesn't pay off when prompts are short, when system content varies per request, or when the request pattern is too sparse to keep the cache warm. The 5-minute Anthropic cache expires fast under low traffic; OpenAI's automatic cache has similar TTL behavior. Workloads of one request every 10 minutes won't see the benefit even when the prompt is stable.
Pass-through vs absorbed savings
AI gateways and proxies sit between the application and the provider, which raises a billing question: does the gateway keep the prompt-caching discount as margin, or pass it through to the customer? The honest answer varies. Some proxies abstract the cached-token count out of their billing model entirely and customers pay the gateway's standard markup on every input token regardless of cache state. Others pass the discount through by computing customer cost against the same cached/uncached token breakdown the provider returned. Worth checking when evaluating any gateway.
Prism passes prompt-caching savings through. When a request returns with cache_read_tokens from Anthropic or cached_tokens from OpenAI, the customer's cost is calculated against the discounted base — then Prism's markup applies on top of the discounted figure. Net effect: when the provider gives a 90% discount on cached input, the customer's bill drops by ~90% on that input slice too. The X-Prism-Native-Cache-Saved-Cents response header surfaces the actual saving per request.
Operational notes
The prefix-matching algorithm is provider-internal and not exposed; what works in practice is keeping the system prompt and any stable context as the very first messages in the array, then appending the user message last. Even a one-character drift in the leading content invalidates the cache hit. For Anthropic, marking system content with the ephemeral cache-control block is mandatory for the cache to engage; for OpenAI, no marker is needed but the prefix must be at least 1,024 tokens. Both implementations skip caching on short prefixes.