AI API caching: the complete 2026 guide
Last updated:
· 21 min readThree layers of LLM response caching, when each wins, what they cost, and how to operate them in production. Measured numbers, no vendor estimates.
AI API caching is the layered strategy of avoiding redundant LLM provider calls — first by recognising byte-identical requests (exact match), then near-duplicate requests by meaning (semantic match), then by relying on the provider's own prompt-prefix cache to discount the calls that still go through. Running all three layers correctly cuts a typical production AI bill by 30–60% with sub-millisecond cache lookup overhead. This guide walks through the mechanics of each layer, when each one wins, the false-positive risks that ruin naive implementations, the economic math, and how to operate the whole stack in production. It's written for engineers actively deciding whether to build, buy, or both.
Why caching is non-negotiable in 2026
Two facts about production LLM traffic make caching the single highest-leverage cost lever available:
Most traffic repeats. The same support question, asked by ten different users in slightly different words. The same system prompt prepended to every request. The same cron job firing every five minutes. The same RAG retrieval against a slowly-changing knowledge base. In any deployed AI product, somewhere between 30% and 70% of requests carry semantic redundancy with a previous request from the same week — sometimes the same hour.
Provider pricing scales linearly with tokens. There is no volume discount that materially changes the slope. A call costing $0.015 today costs $0.015 the millionth time you make it. Caching is the only structural mechanism that breaks the linear scaling.
The combination means caching is not an optimization — it's the bare minimum cost discipline for any AI product past hobbyist scale. The interesting question is not whether to cache, but which layers to run and how to operate them without introducing correctness problems.
The three layers
Production-grade AI API caching is three distinct layers solving three distinct problems. They stack cleanly; they're not alternatives.
Layer 1 — Exact-match caching
What it does: computes a deterministic fingerprint of the full request (SHA-256 over normalised messages, model name, temperature, top_p, max_tokens, and any other request parameters), looks up the fingerprint in a key-value store, and serves the cached response if the fingerprint exists.
Where it lives: typically Redis or an equivalent low-latency key-value store, sized to a few hundred MB to a few GB depending on traffic volume.
Latency: sub-8ms p95 round-trip in a same-region deployment. The lookup itself is a single Redis GET; the only meaningful cost is the SHA-256 computation, which is microseconds.
Hit rate in production AI traffic: typically 5–15%. The catch is that most production prompts carry some per-request variation — a user name, a session ID, a recently-retrieved RAG passage, a current timestamp — that makes byte-identical matches rare. Exact-match cache hits dominate on deterministic workloads (cron jobs, ETL pipelines, evaluation runs) and tail off rapidly on user-facing chat traffic.
Correctness: provably perfect. An exact-match cache returns the same response if and only if the request was byte-identical to a previous request. The only correctness concern is invalidation when the underlying answer changes — addressed via TTL or explicit purge.
The fingerprinting discipline that makes exact-match cache hits more common than the byte-level pattern would suggest is normalisation. Sort message keys deterministically. Strip whitespace consistently. Resolve nullable fields to the same canonical absent-or-zero state. Don't include request-id or timestamp fields in the fingerprint. Two requests that are trivially equivalent should fingerprint to the same hash.
See Prompt cache fingerprinting pitfalls for the edge cases we hit during Prism's v1.1 build.
Layer 2 — Semantic caching
What it does: embeds the user's prompt with a sentence-embedding model (typically a small fast one like BGE-small-en-v1.5 at 384 dimensions, or text-embedding-3-small at 1536), queries a vector database for the nearest stored embedding, and serves the cached response associated with that stored embedding if the cosine similarity exceeds a threshold.
Where it lives: a vector database — Upstash Vector, Pinecone, Qdrant, pgvector, etc. The embeddings + their associated cached responses are stored together (or referenced by ID into the same key-value store as Layer 1).
Latency: 20–40ms p95 round-trip including the embedding inference. The embedding call dominates the latency budget; the vector index lookup itself is sub-5ms with a well-configured HNSW index.
Hit rate in production AI traffic: typically 25–50% on top of whatever exact-match caught. Customer support chatbots see the high end (40–60% — paraphrases of the same FAQ question land within similarity threshold). Tool-using agents with variable retrieval contexts see the low end (15–25%). Pure code-generation workloads often see <10% because variable names and code structure dominate the embedding space over intent.
Correctness: probabilistic. The cosine similarity threshold is the dial that trades hit rate for false-positive rate. The deeper-dive on threshold engineering lives in Exact vs semantic caching for LLMs; the summary is: 0.95 is the defensible default for general workloads, 0.92 is aggressive but workable for narrow domains, and anything below 0.90 starts returning confidently-wrong answers at concerning rates.
The threshold-tuning discipline that makes this safe is sampled validation. Run at 0.95. Periodically pull 100 random hits, have a human (or a stronger LLM-as-judge) verify whether the cached response was appropriate to the new prompt. If false-positive rate is <2%, you can experiment with lowering the threshold to recover more hits. If it's >5%, raise it. Never tune the threshold by intuition.
Layer 3 — Provider-native prompt caching (passthrough)
What it does: different layer entirely. Doesn't cache the response — caches the prefix-attention computation on the provider's inference servers. When a request hits the provider with a stable system prompt or static context the provider has seen recently, the provider returns the same response it would have anyway but bills the cached portion of the input at a discounted rate.
Where it lives: Anthropic and OpenAI both ship this server-side. As a customer (or as a gateway), you don't run any infrastructure for it — you opt in (Anthropic, via explicit cache_control markers) or it engages automatically (OpenAI, on prompts ≥1,024 tokens).
Latency: ~zero overhead at lookup time. The provider's internal cache lookup is microseconds; the savings are billing-side.
Savings:
- Anthropic discounts cache-read tokens to 10% of normal input price (a 90% discount). First request that creates a cache entry pays a 25% write premium; subsequent reads within the 5-minute default TTL (or 1-hour extended TTL) pay the 10% rate. Break-even arrives at the second hit on any stable prefix.
- OpenAI discounts cached tokens to 50% of normal input price. No write premium, no caller-side configuration. Engages automatically on prompts ≥1,024 tokens with the cache TTL roughly mirroring Anthropic's default range.
Hit pattern: every request with a stable system prompt longer than ~few hundred tokens (Anthropic) or 1,024 tokens (OpenAI) hits the provider cache, regardless of whether Layers 1 and 2 caught the request. Layer 3 is independent of Layers 1 and 2.
The pass-through question: when an AI gateway sits between your application and the provider, the gateway can either keep the provider-native discount as margin or pass it through to the customer. The honest answer varies by gateway; worth verifying for any product you adopt. Prism passes it through — when a request returns with cache_read_tokens from Anthropic or cached_tokens from OpenAI, the customer's bill is calculated against the discounted base, and the X-Prism-Native-Cache-Saved-Cents response header surfaces the savings.
See the prompt-caching glossary entry for the per-provider mechanics in shorter form.
How the layers compose
The three layers run in order, short-circuiting on a hit:
incoming request
↓
[Layer 1] exact-match fingerprint lookup ← sub-8ms p95
↓ miss
[Layer 2] semantic embed + vector search ← 20-40ms p95
↓ miss
[Layer 3] dispatch to provider with
cache_control markers attached ← provider call + native cache discount on hit
↓
response back to caller; result stored in Layers 1 + 2 for future
A hit in Layer 1 short-circuits Layers 2 and 3 entirely (the cached response is returned directly). A hit in Layer 2 short-circuits Layer 3. A Layer 3 hit still pays the provider call's latency and its (discounted) cost, but at substantially less than full-token billing.
The composition matters because the layers don't dominate each other. On a workload where Layer 1 catches 10% of traffic and Layer 2 catches another 35% (i.e. 35% of the remaining 90%, or about 31% of total traffic), Layer 3 still applies to the remaining 59% of requests that miss both upstream layers. Roughly 90% of total production traffic typically touches at least one caching layer when all three are running — the question is which one.
The economic model
Caching's unit economics are unusually favourable because the cost of running the cache is rounding-error against the avoided spend.
Worked example
Suppose you operate an LLM-backed support chatbot built on Claude Sonnet:
- 20,000 chat completions per day
- Average prompt: 800 input tokens, 300 output tokens
- Provider list price (illustrative): $3 per million input, $15 per million output
- Bill without caching: 20,000 × (800 × $3 + 300 × $15) / 1,000,000 = $138/day (≈$4,200/month)
Now layer in caching:
| Layer | Hit rate | Saved per day | Notes |
|---|---|---|---|
| Layer 1 (exact) | 8% | $11.04 | conservative; deterministic + duplicate-submit traffic |
| Layer 2 (semantic at 0.95) | 38% of the remainder | $48.18 | typical support-bot hit rate |
| Layer 3 (provider-native, Anthropic 90% discount on input) | applies to the 54% that miss both above | $19.20 | savings on the input-token portion of the still-dispatched calls |
| Total avoided per day | — | $78.42 | |
| Bill after caching | — | $59.58/day (≈$1,820/month) |
Net reduction: ~57%.
VERIFY (founder): replace this worked example with one drawn from a real Prism customer profile or a representative aggregated shape, using current provider pricing. The numbers above are reasonable industry-typical but worth grounding in production data.
What the caching infrastructure costs
| Component | Approximate cost |
|---|---|
| Redis (managed, ~5GB) | ~$30/month |
| Vector index (Upstash Vector, ~500K vectors) | ~$30/month |
| Embedding inference (BGE-small on a small VM or sidecar) | ~$10/month at the volume above |
| Total infra | ~$70/month |
The Layer 2 embedding inference cost at the call level is around $0.00002 per embed — 20,000 calls per day costs $0.40/day in embedding. Negligible against the $48/day saved on cache hits, and negligible against the $0.40/day embed costs themselves dominate over the vector-index ops cost.
The infrastructure is paid back in under a day's traffic. The break-even threshold is laughably low — even at 0.5% combined Layer 1 + Layer 2 hit rate, the caching stack pays for itself.
Where the math gets interesting
The math is favourable across the board, but the workload-shape sensitivity is real:
- Chatbot + FAQ + documentation Q&A workloads: ~50% bill reduction typical, ~60% achievable with tuning.
- Tool-using agent workloads: ~25–35% reduction. Lower because intent variability is higher and exact matches are rare.
- Code-generation workloads: ~15–25% reduction. Layer 1 catches deterministic test/eval cases; Layer 2 underperforms because variable names dominate the embedding space; Layer 3 carries most of the weight.
- One-shot transformation workloads (translation, summarisation of fresh content): <10% reduction. Each request is genuinely novel; Layer 3 is the only meaningful contributor.
The discipline is: instrument hit rate per layer, sample false-positive rate, and don't push semantic threshold below validated-safe levels.
Building this yourself
If you're tempted to build, here's what you actually need to put together. The components are not exotic; the integration discipline is what kills naive implementations.
Storage:
- Key-value store for Layer 1 + the response payloads (Redis is the default; KeyDB or DragonflyDB for higher throughput).
- Vector database with HNSW or equivalent for Layer 2 (pgvector, Qdrant, Pinecone, Upstash Vector, Weaviate — all work).
Embedding inference:
- A small fast sentence-embedding model (BGE-small-en-v1.5 at 384-dim, gte-small, MiniLM-L6-v2). Run as a sidecar HTTP service or co-located with the cache process; the per-call latency budget is ~10ms.
- Hosted alternative: text-embedding-3-small from OpenAI. Higher dimension, more accurate, more expensive per call and adds a network hop.
Fingerprint normalisation library:
- The single most-bug-prone part of the system. Canonical ordering of message keys, consistent treatment of nullable parameters, whitespace normalisation, deterministic JSON serialisation. Test exhaustively.
TTL + invalidation:
- TTL is the simple lever: cached responses expire after some duration (5 minutes for time-sensitive content, 24 hours+ for stable knowledge-base content).
- Explicit invalidation requires pattern-based purge support from your KV store. Redis SCAN + DEL works for moderate scale; production deployments often want a tag-based system where cache entries carry tags and you purge by tag.
Pass-through wiring for Layer 3:
- Anthropic: attach
cache_control: { type: "ephemeral" }blocks to the portions of the messages array you want cached (typically the system message and any static context). Without the marker, no caching. - OpenAI: nothing to do at request time. Read
cached_tokensfrom the response's usage block to know what hit. - Customer billing: subtract the cached-token discount from the cost you bill (if you're a gateway), or just enjoy the lower bill (if you're the application).
Observability:
- Per-request: which layer (if any) served the response, the cache age, the similarity score (for Layer 2 hits), and the dollar savings.
- Aggregate: hit rate per layer per task type per day. Use this to tune threshold and TTL.
- Sampled validation: a periodic job that pulls 100 Layer 2 hits and runs a stronger LLM as a judge on whether the cached response was appropriate.
Failure modes:
- Cache poisoning by storing an erroneous response. Mitigation: only cache successful responses with non-zero token counts; never cache mid-stream errors.
- Threshold drift: a workload that was safe at 0.95 yesterday might creep into false-positive territory as the user base broadens. Re-validate quarterly.
- Embedding model upgrade: re-embedding the entire cache after switching models is the only safe migration path. Plan for it.
If you'd rather not build all of this, Prism ships all three layers running concurrently on every Pro+ request with the operational discipline baked in. Skip to the next section if you're evaluating; the rest of this section assumes you're committed to building.
Operating it in production
Three operational habits separate caching deployments that work from caching deployments that quietly degrade.
1. Monitor false-positive rate continuously.
The most insidious failure mode is a Layer 2 semantic cache that returns wrong-but-similar answers. The user gets bad information, doesn't realise it came from a cache, blames the product. You may not know it's happening for weeks. The mitigation is sampled validation: every day or every week, pull 100 random Layer 2 hits and judge them. Track the false-positive rate as a first-class metric, alongside hit rate.
2. Stratify hit rate by task type.
Hit rate averaged across all traffic is a useless number. A 30% aggregate hit rate could be 50% on chatbot traffic, 15% on agent traffic, and 5% on code generation — and the right tuning move depends on which is happening. Tag every request with its task type at ingest, and track per-task-type hit rate.
3. Validate threshold against actual traffic, not against intuition.
The 0.95 default is conservative and works for most workloads. But the "right" threshold for your specific workload depends on how your prompts cluster in embedding space — which depends on your domain, your users' phrasing variability, and your prompt structure. Run a sampled threshold-sweep monthly: at the current threshold, what's the false-positive rate? At 0.93, what would the rate be (estimated from a sample at 0.93)? Pick the threshold that maximises hit rate subject to false-positive rate <2%.
Build vs buy
There's a clean economic argument both ways.
Build if:
- You have unique workload shape that benefits from custom fingerprinting (e.g. you want to ignore certain request fields that are noise for your application).
- You've already deployed Redis + a vector database and the marginal cost is low.
- You need on-prem deployment for compliance reasons that exclude managed AI gateways.
- You have engineering time to spare on the normalisation + threshold-tuning discipline.
Buy if:
- You want to compress weeks of build + tune + validate into days.
- You're paying for caching infrastructure already AND for engineer time AND not getting the false-positive monitoring discipline right.
- You want provider-native pass-through to land for free (most homegrown caches skip Layer 3 entirely because the per-provider integration takes work).
- The marginal cost of the gateway is less than the engineering time you'd spend building.
A reasonable benchmark: if your monthly LLM spend is below ~$2,000, the per-request markup on a managed gateway probably costs less than the engineering hours to build the stack. Above $20,000/month, the in-house economics start to win on token markup if you have the engineering capacity to operate it well. The middle range is where the build-vs-buy decision is workload-specific.
How Prism implements this
Prism runs all three layers on every Pro+ request. The relevant design choices, for engineers evaluating:
- Layer 1 storage: Upstash Redis (Mumbai region), sub-8ms p95 lookup. Fingerprint is SHA-256 over the normalised messages array + model + temperature + top_p + max_tokens. Per-account scoping by default; per-project scoping configurable on Pro/Team.
- Layer 2 storage: Upstash Vector (Mumbai region), HNSW index, BGE-small-en-v1.5 embeddings (384-dim). Default cosine threshold 0.95; Pro/Team customers tune per-project via
X-Prism-Cache-Thresholdheader. Embedding inference runs on a sidecar container so an embedding-side spike can't take the API process down. - Layer 3 pass-through: Anthropic and OpenAI cache hits are detected from the upstream
cache_read_tokensandcached_tokensresponse fields. The customer's bill is calculated against the discounted base — Prism's markup applies on top of the discounted figure, so the provider discount is passed through cleanly. TheX-Prism-Native-Cache-Saved-Centsresponse header surfaces the saving per request. - TTL: default 1 hour. Configurable 60s–7d on Free + BYOK, 60s–30d on Pro/Team via
X-Prism-Cache-TTLheader. - Edge replication: cache entries propagate to Cloudflare Workers KV globally via the
prism-edgeWorker. International cache hits from Singapore land in 184ms vs 484ms direct to Mumbai origin. - Observability: every response carries
X-Prism-Cache-Status(one ofmiss,hit-exact,hit-semantic,bypass),X-Prism-Cache-Similarity(0.0–1.0 on Layer 2 hits),X-Prism-Cache-Age-Seconds(cache entry age on hits), andX-Prism-Cache-Saved-Cents(dollar savings vs uncached). The cache inspector at/dashboard/cacheshows hit-rate-at-threshold curves and lets you simulate tuning before committing.
VERIFY (founder): confirm the BGE-small dimension (384) and the embedding model name string match production. Confirm the X-Prism-Cache header names match what's emitted from
services/cache_headers.py. Confirm the latency p95s match recent telemetry.
The savings calculator models the expected impact on your own workload using the same pricing inputs Prism uses internally.
Decision framework
If you're picking what to do for your workload:
- Always run Layer 1. Cost is trivial. Hits are pure wins. Correctness is guaranteed.
- Run Layer 2 if your workload has paraphrasable intent. Customer support, in-product help, FAQ, documentation Q&A — yes. Pure tool-calling agents with high-cardinality context — probably not.
- Start at threshold 0.95. Instrument false-positive rate. Tune. Default is conservative on purpose.
- Wire up Layer 3 for any workload with a stable system prompt of meaningful length. Anthropic ≥a few hundred tokens with
cache_controlmarkers; OpenAI ≥1,024 tokens automatic. - Stratify hit-rate monitoring by task type. Aggregate hit rate is misleading.
- Validate threshold quarterly. Workloads evolve; thresholds that were safe last quarter may not be safe this quarter.
The economics on response caching for LLM APIs are unusually favourable — false-positive risk is the only real cost, and that's a discipline problem, not an unsolvable one.
Where to go next
If you're building: start with Prompt cache fingerprinting pitfalls for the normalisation discipline that prevents most Layer 1 bugs, then Exact vs semantic caching for LLMs for the threshold-tuning detail.
If you're comparing AI gateways: Prism vs Portkey, Prism vs Helicone, and the broader AI gateway comparison guide cover the surface.
If you want to model your savings: the savings calculator takes your token volume + workload shape and outputs expected per-month bill reduction.
If you want to skip the rest and just start: signup at ssimplifi.com is free, no credit card. Bring your own provider key and most of Pro unlocks within fair-use (v1.9, rolling out shortly).
Frequently asked questions
Should I always cache LLM responses?
Not always, but almost always. The exception is workloads where every request is genuinely novel and a cache hit would be miraculous — one-shot transformations on fresh content, real-time market analysis, custom report generation from per-request inputs. For everything else — and "everything else" is the large majority of production LLM traffic — caching pays for itself in days.
What's a realistic AI API caching hit rate?
Across all three layers combined, typically 40–70% of requests touch at least one cache layer in a well-tuned production deployment. Layer 1 alone catches 5–15%, Layer 2 catches another 25–50% of the remainder on workloads with paraphrasable intent, Layer 3 (provider-native pass-through) discounts the input-token portion of the calls that miss both upstream layers. Workload shape dominates — chatbot/FAQ at the high end, pure tool-using agents at the low end.
Does semantic caching ever return the wrong answer?
It can, and that's the central engineering problem. The defence is the cosine similarity threshold (0.95 default for general workloads) plus continuous sampled false-positive monitoring. A correctly-tuned and validated semantic cache holds false-positive rate below 2% on most workloads; an untuned one can hit 10–15% on broad chat traffic. Don't deploy semantic caching without the validation discipline.
Is Anthropic prompt caching worth the 25% write premium?
Yes, on any workload where the same stable prefix is reused. The break-even is at the second request — 1.25x + 0.10x averaged is 0.675x, already a 32% saving. Three hits drops the average to 0.483x. Long-running stable prefixes (system prompts, retrieved-context preambles, tool definitions) hit dozens of times within the 5-minute default TTL, so the practical effective discount approaches the 0.10x rate.
Can I run semantic caching without an embedding model?
No — that's the definition. What you can do is run Layer 1 + Layer 3 only, which catches a real chunk of traffic with no embedding dependency. This is often the right starting point for teams adding caching to an existing system; add Layer 2 once Layer 1 + Layer 3 are operating cleanly.
How does cache invalidation work?
Two mechanisms, used together. TTL expires entries automatically after a configured duration; most deployments default to 1 hour, longer for stable knowledge-base content, shorter for time-sensitive responses. Explicit invalidation purges entries matching a pattern when source-of-truth content changes — Prism supports this via the cache inspector at /dashboard/cache on Pro+; in homegrown systems, tag-based eviction is the cleanest pattern.
Do I need a vector database for semantic caching?
Practically yes. Vector similarity search at any meaningful scale requires an index (HNSW or equivalent). Options include managed services (Upstash Vector, Pinecone), self-hosted (Qdrant, Weaviate, pgvector on existing Postgres), and embedded (LanceDB). Don't try to do this with cosine-distance scans over a flat array past ~10K entries.
Will the cache hit rate decay over time?
The cache itself doesn't decay, but hit rate can drift if your traffic mix changes. As you onboard new user cohorts, expand into new feature areas, or change your prompt structure, the cache may need re-tuning. The operational habit is: re-validate hit rate by task type and false-positive rate quarterly; tune threshold and TTL based on what changed.
What happens to caching on streaming requests?
Cache hits are returned as non-streaming JSON regardless of the request's stream=true flag — serving a cached response as a fake stream is possible but adds complexity for no real benefit. Cache writes on streaming responses are stored after the stream completes; partial or errored streams are never cached. This is the same discipline most production caching layers follow.
Is the gateway's caching layer better than the provider's own prompt caching?
They solve different problems and stack. Gateway-side Layers 1 + 2 avoid the provider call entirely; provider-side Layer 3 discounts the calls that still go through. Use both. The "vs" framing is a category mistake.
Want this content in shorter form? See the semantic cache glossary entry, prompt-caching glossary entry, or model your workload's savings in the savings calculator.
Deep dives on ai api caching
Five cluster posts unpack the sub-topics of this pillar. Each ships independently as part of the content calendar.