Last updated:

Exact vs semantic cache

Exact caches require byte-identical inputs; semantic caches use embeddings + cosine similarity to match near-equivalent prompts.

The 60-second answer

An exact-match cache returns a stored response when a new request is byte-identical to a previous one — same fingerprint, same answer. Fast (sub-10ms p95), correct by definition, and hits about 5-15% of production LLM traffic. A semantic cachereturns a stored response when a new request's embedding is close enough to a previous embedding — same meaning, even if the words differ. Slower (20-40ms p95 including the embedding inference), probabilistic on correctness, and hits another 25-50% of traffic on workloads with paraphrasable intent. Production deployments run both; they catch overlapping but distinct slices of traffic.

How exact-match works

Compute a deterministic fingerprint of the request — typically a SHA-256 hash over the canonicalised messages array plus relevant request parameters (model, temperature, top_p, max_tokens). Look up the fingerprint in a key-value store; if present, return the cached response. The discipline is fingerprint normalisation: two trivially-equivalent requests (different whitespace, different field ordering, optional fields present-vs-absent) need to hash to the same key. Without that discipline, hit rates stay near zero.

How semantic-match works

Embed the user's prompt with a sentence-embedding model (BGE-small at 384 dimensions is the common default; OpenAI text-embedding-3-small at 1536 is the alternative). Query a vector database for the nearest stored embedding. If the cosine similarity exceeds a threshold (0.95 is the standard production default), return the cached response associated with that stored embedding. The threshold is the trade between hit rate and false-positive rate: lower threshold catches more hits but accepts more wrong answers.

When each wins

Exact-match wins on deterministic workloads (cron jobs, evaluation runs, regression tests where the same prompts fire repeatedly), correctness-critical contexts (legal, medical, financial where a wrong answer is liability), and short-prompt high-volume scenarios where the embedding cost would dominate.

Semantic-match wins on workloads where users phrase the same question multiple ways (customer support, FAQ, documentation Q&A), knowledge-grounded LLM apps where the answer space changes slowly, and any workload where the unit-economics of avoiding a cached call exceed the embedding-inference cost (almost always — the break-even hit rate is below 0.5%).

Combined effect

In production AI gateway deployments running both, exact-match catches ~10% of traffic, semantic catches another 30-40% of the remainder, and provider-native prompt-caching (a third layer that lives provider-side) discounts much of what's left. The combined effect is typically a 40-60% reduction in total LLM bill on workloads where caching applies. Different layers, different mechanics, designed to stack.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Can I run only exact-match and skip semantic?
Yes, and it's a reasonable starting point. Exact-match is cheap, correct, and catches the deterministic-traffic slice. Adding semantic later (when paraphrasable intent becomes a real workload) is straightforward. The opposite — running only semantic and skipping exact-match — is rarely the right call, because exact-match is free of the false-positive risk semantic carries.
What's the false-positive rate on semantic caching at threshold 0.95?
Typically 1-3% on production workloads with broad intent diversity, lower on narrow-domain workloads (e.g. a chatbot for one product's documentation). Higher than that — 5%+ — usually means the threshold is too low for the workload or the embedding model isn't a good fit for the domain. Re-validate quarterly via sampled human judgment.
Which layer matters more for cost reduction?
Semantic typically catches more traffic, so it dominates in raw savings. Exact-match has higher value-per-hit because it's correct by definition (no false-positive risk). On most workloads, semantic delivers the larger absolute savings; exact-match delivers more reliable savings. Production deployments run both because the value of stacking them exceeds either alone.
Does Prism run both?
Yes — exact-match in Redis with SHA-256 fingerprints, semantic in Upstash Vector with BGE-small embeddings at cosine threshold 0.95, plus a third layer (provider-native passthrough) that captures Anthropic + OpenAI prompt-cache discounts. All three concurrently, by default, on every paid request.