Semantic cache
A cache that returns prior responses when a new prompt is semantically similar (not just byte-identical) to a previously cached one.
How it works
A semantic cache stores responses keyed by the meaning of the input instead of by its bytes. When a new request arrives, the cache embeds it into a vector (typically 384 or 768 dimensions using a sentence-embedding model like BGE-small or text-embedding-3-small), then runs an approximate-nearest-neighbor search against the cached embeddings. If a match exists with cosine similarity above a threshold (commonly 0.95), the cached response is returned without calling the LLM.
The mechanic differs from an exact cache in one crucial way: an exact cache hashes the full request and only returns a hit if the hash matches byte-for-byte. A semantic cache tolerates rephrasing, whitespace differences, capitalization, punctuation, and even meaningful synonyms — anything the embedding model treats as semantically equivalent. The cost is a small embedding-inference latency on every lookup (typically 10-50ms) and the risk of false positives when two distinct intents happen to embed similarly.
When it matters
Semantic caching pays off when the workload contains near-duplicate prompts: customer-support bots receiving "how do I reset my password" in twenty phrasings; documentation Q&A where the same conceptual question gets asked daily; recommendation engines where similar user profiles produce similar prompts. Internal benchmarks across Prism's customer base show 30-60% hit rates on these workloads — far above the 5-15% an exact cache catches alone.
It doesn't pay off — and can hurt — on workloads where small phrasing differences carry semantic weight: legal document analysis, code generation with precise specifications, scientific writing. For those, an exact cache plus provider-native prompt caching delivers most of the savings without the false-positive risk.
Threshold tuning
The cosine-similarity threshold is the dial that trades hit rate for accuracy. At 0.99 the cache only fires on near-identical queries — essentially an expensive exact cache. At 0.85 the cache fires aggressively, often returning a stale or off-topic response. The sweet spot for most production workloads is between 0.93 and 0.97; Prism defaults to 0.95 and lets Pro/Team customers tune the threshold per-project on the dashboard. A second knob worth tuning is TTL: short-lived (5 minutes) for time-sensitive prompts, longer (24 hours or more) for stable knowledge-base queries.
How Prism implements it
Prism's semantic cache uses BGE-small-en-v1.5 (384-dim) for embeddings and Upstash Vector for the index, namespaced per project. The embedding pass runs on a sidecar container so an embedding-side spike can't take the API process down. Every cache hit emits the cosine similarity in an X-Prism-Cache-Similarity response header so customers can audit hit quality. See the AI API Caching guide for the full implementation + measured economics.