Exact vs semantic cache
Exact caches require byte-identical inputs; semantic caches use embeddings + cosine similarity to match near-equivalent prompts.
The 60-second answer
An exact-match cache returns a stored response when a new request is byte-identical to a previous one — same fingerprint, same answer. Fast (sub-10ms p95), correct by definition, and hits about 5-15% of production LLM traffic. A semantic cachereturns a stored response when a new request's embedding is close enough to a previous embedding — same meaning, even if the words differ. Slower (20-40ms p95 including the embedding inference), probabilistic on correctness, and hits another 25-50% of traffic on workloads with paraphrasable intent. Production deployments run both; they catch overlapping but distinct slices of traffic.
How exact-match works
Compute a deterministic fingerprint of the request — typically a SHA-256 hash over the canonicalised messages array plus relevant request parameters (model, temperature, top_p, max_tokens). Look up the fingerprint in a key-value store; if present, return the cached response. The discipline is fingerprint normalisation: two trivially-equivalent requests (different whitespace, different field ordering, optional fields present-vs-absent) need to hash to the same key. Without that discipline, hit rates stay near zero.
How semantic-match works
Embed the user's prompt with a sentence-embedding model (BGE-small at 384 dimensions is the common default; OpenAI text-embedding-3-small at 1536 is the alternative). Query a vector database for the nearest stored embedding. If the cosine similarity exceeds a threshold (0.95 is the standard production default), return the cached response associated with that stored embedding. The threshold is the trade between hit rate and false-positive rate: lower threshold catches more hits but accepts more wrong answers.
When each wins
Exact-match wins on deterministic workloads (cron jobs, evaluation runs, regression tests where the same prompts fire repeatedly), correctness-critical contexts (legal, medical, financial where a wrong answer is liability), and short-prompt high-volume scenarios where the embedding cost would dominate.
Semantic-match wins on workloads where users phrase the same question multiple ways (customer support, FAQ, documentation Q&A), knowledge-grounded LLM apps where the answer space changes slowly, and any workload where the unit-economics of avoiding a cached call exceed the embedding-inference cost (almost always — the break-even hit rate is below 0.5%).
Combined effect
In production AI gateway deployments running both, exact-match catches ~10% of traffic, semantic catches another 30-40% of the remainder, and provider-native prompt-caching (a third layer that lives provider-side) discounts much of what's left. The combined effect is typically a 40-60% reduction in total LLM bill on workloads where caching applies. Different layers, different mechanics, designed to stack.