Question 1

Won't semantic caching return wrong answers when prompts are different but similar?

Accepted Answer

It can — that's the false-positive risk. The defense is the cosine-similarity threshold: 0.95 is high enough that meaningful intent differences (e.g., 'cancel my subscription' vs 'how do I cancel a subscription') produce embedding distances above the cutoff. Prism lets customers tune this per-project + provides the X-Prism-Cache-Similarity header on every hit so you can audit and adjust.

Question 2

How is semantic cache different from prompt caching?

Accepted Answer

Different layer. Semantic caching matches near-duplicate prompts and returns a cached response without calling the LLM. Prompt caching (Anthropic's cache_control blocks, OpenAI's automatic prompt cache) reduces the input-token cost of repeated system-prompt prefixes when the LLM IS called. They stack — Prism does both.

Question 3

What's a realistic semantic-cache hit rate?

Accepted Answer

Depends entirely on workload. Customer-support bots and FAQ-style chatbots see 30-60% hit rates. Coding assistants with specific specifications see 5-15%. Workloads where every prompt is genuinely unique (legal analysis, custom report generation) see near-zero. Run the cache-hit-rate estimator to model your workload before betting on a number.

Question 4

What embedding model should we use for the cache?

Accepted Answer

For most workloads, a 384-dim sentence-embedding model like BGE-small-en-v1.5 is the right choice — fast, cheap, sufficient quality for similarity matching. OpenAI's text-embedding-3-small is the popular alternative at 1536 dims (more accurate but slower + more expensive per embed). The model needs to be consistent: re-embedding the entire cache after switching models is the only way to migrate.

Semantic cache

How it works

When it matters

Threshold tuning

How Prism implements it

See your savings before you sign up

Frequently asked questions

Related reading

All glossary terms

Read the guides