Last updated:

Semantic cache

A cache that returns prior responses when a new prompt is semantically similar (not just byte-identical) to a previously cached one.

How it works

A semantic cache stores responses keyed by the meaning of the input instead of by its bytes. When a new request arrives, the cache embeds it into a vector (typically 384 or 768 dimensions using a sentence-embedding model like BGE-small or text-embedding-3-small), then runs an approximate-nearest-neighbor search against the cached embeddings. If a match exists with cosine similarity above a threshold (commonly 0.95), the cached response is returned without calling the LLM.

The mechanic differs from an exact cache in one crucial way: an exact cache hashes the full request and only returns a hit if the hash matches byte-for-byte. A semantic cache tolerates rephrasing, whitespace differences, capitalization, punctuation, and even meaningful synonyms — anything the embedding model treats as semantically equivalent. The cost is a small embedding-inference latency on every lookup (typically 10-50ms) and the risk of false positives when two distinct intents happen to embed similarly.

When it matters

Semantic caching pays off when the workload contains near-duplicate prompts: customer-support bots receiving "how do I reset my password" in twenty phrasings; documentation Q&A where the same conceptual question gets asked daily; recommendation engines where similar user profiles produce similar prompts. Internal benchmarks across Prism's customer base show 30-60% hit rates on these workloads — far above the 5-15% an exact cache catches alone.

It doesn't pay off — and can hurt — on workloads where small phrasing differences carry semantic weight: legal document analysis, code generation with precise specifications, scientific writing. For those, an exact cache plus provider-native prompt caching delivers most of the savings without the false-positive risk.

Threshold tuning

The cosine-similarity threshold is the dial that trades hit rate for accuracy. At 0.99 the cache only fires on near-identical queries — essentially an expensive exact cache. At 0.85 the cache fires aggressively, often returning a stale or off-topic response. The sweet spot for most production workloads is between 0.93 and 0.97; Prism defaults to 0.95 and lets Pro/Team customers tune the threshold per-project on the dashboard. A second knob worth tuning is TTL: short-lived (5 minutes) for time-sensitive prompts, longer (24 hours or more) for stable knowledge-base queries.

How Prism implements it

Prism's semantic cache uses BGE-small-en-v1.5 (384-dim) for embeddings and Upstash Vector for the index, namespaced per project. The embedding pass runs on a sidecar container so an embedding-side spike can't take the API process down. Every cache hit emits the cosine similarity in an X-Prism-Cache-Similarity response header so customers can audit hit quality. See the AI API Caching guide for the full implementation + measured economics.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Won't semantic caching return wrong answers when prompts are different but similar?
It can — that's the false-positive risk. The defense is the cosine-similarity threshold: 0.95 is high enough that meaningful intent differences (e.g., 'cancel my subscription' vs 'how do I cancel a subscription') produce embedding distances above the cutoff. Prism lets customers tune this per-project + provides the X-Prism-Cache-Similarity header on every hit so you can audit and adjust.
How is semantic cache different from prompt caching?
Different layer. Semantic caching matches near-duplicate prompts and returns a cached response without calling the LLM. Prompt caching (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) reduces the input-token cost of repeated system-prompt prefixes when the LLM IS called. They stack — Prism does both.
What's a realistic semantic-cache hit rate?
Depends entirely on workload. Customer-support bots and FAQ-style chatbots see 30-60% hit rates. Coding assistants with specific specifications see 5-15%. Workloads where every prompt is genuinely unique (legal analysis, custom report generation) see near-zero. Run the cache-hit-rate estimator to model your workload before betting on a number.
What embedding model should we use for the cache?
For most workloads, a 384-dim sentence-embedding model like BGE-small-en-v1.5 is the right choice — fast, cheap, sufficient quality for similarity matching. OpenAI's text-embedding-3-small is the popular alternative at 1536 dims (more accurate but slower + more expensive per embed). The model needs to be consistent: re-embedding the entire cache after switching models is the only way to migrate.