Question 1

How is prompt caching different from semantic caching?

Accepted Answer

Different layer entirely. Prompt caching (provider-native) caches the prefix attention state on the provider's side and discounts the input tokens; the LLM still generates fresh output for each request. Semantic caching (gateway-side) caches the full response keyed by embedded prompt similarity, so a cache hit returns the previous response without calling the LLM. The two stack cleanly — prompt caching reduces the cost of the requests that DO reach the model, semantic caching avoids reaching the model at all.

Question 2

Do I need to do anything to enable OpenAI's prompt caching?

Accepted Answer

No code change needed. OpenAI's prompt cache is automatic on prompts ≥1,024 tokens; the discount appears in the usage block of the response as cached_tokens billed at 50% of the normal input price. The only requirement is that your system prompt + leading context is stable across requests — even minor variations like a timestamp in the prefix invalidate the cache hit.

Question 3

What's the break-even on Anthropic prompt caching given the 25% write premium?

Accepted Answer

On most production workloads, break-even arrives at the second request that hits the cache. The first request pays 1.25x normal input price (a 25% premium for the write), but every subsequent hit within the 5-minute TTL pays 0.10x. Two hits net out to (1.25 + 0.10) ÷ 2 = 0.675x normal cost, already a 32% saving. Three hits drop the average to 0.483x, around 52% saved. The longer the cache stays warm, the closer total cost approaches the steady-state 0.10x.

Question 4

Does prompt caching work with the OpenAI Chat Completions streaming API?

Accepted Answer

Yes. The cached_tokens count is returned in the final usage chunk of the stream (with stream_options.include_usage set). Streaming and prompt caching are independent — the cache discount applies regardless of whether the response is streamed or returned as a single JSON object.

Prompt caching

How it works

Two implementations: Anthropic vs OpenAI

When it matters

Pass-through vs absorbed savings

Operational notes

See your savings before you sign up

Frequently asked questions

Related reading

All glossary terms

Read the guides