Last updated:

Speculative routing

Firing the same request to two AI providers in parallel and returning the first successful response — a tail-latency reduction technique.

How it works

Speculative parallel routing is a latency-hedging mechanic for LLM API requests: the gateway fires the primary model call and the first healthy fallback simultaneously, returns whichever response arrives first, and cancels the loser mid-flight. The technique trades ~1.3x average token cost for a meaningful reduction in p99 latency, especially when one provider degrades while another stays healthy.

The mechanic is borrowed from speculative execution in distributed systems: when waiting for an answer is expensive, do the work twice and take the first answer back. For LLM gateways, "expensive" is user-perceived latency on a slow provider; "doing the work twice" is two provider calls.

What it buys you

Lower p99 latency. When one provider returns 200s in 8 seconds and the other in 800ms, the customer sees the 800ms response. Without speculative, the 50% of requests routed to the slow provider would have seen the 8-second response.

Resilience to single-provider degradation.Provider outages and capacity issues that would otherwise propagate to the application as slow responses or 5xx errors get smoothed over by the healthy second provider. The customer doesn't notice the degradation.

Implicit failover.If one provider returns an error, the other's success still propagates. The error path is "both failed" rather than "primary failed, retry against secondary."

What it costs

Token cost.The loser provider keeps generating tokens until the cancel propagates over the HTTP connection — typically a few hundred milliseconds' worth of tokens, which get billed normally. Empirically the average token cost is ~1.3x a serial call; better on short generations (less time for the loser to generate), worse on long generations.

Streaming complexity. Mid-stream speculation is messy. If both providers are streaming, the gateway has to demux two SSE streams, the customer sees first-token from whichever responds first, and switching mid-stream is hard. Most production implementations skip streaming entirely under speculative routing.

When it makes sense

Speculative routing earns its complexity when:

  • Sport mode / max-quality tier. The caller has already declared they prefer quality over cost; spending 30% more tokens for latency hedging fits the existing trade.
  • p99 latency is a real product concern. Customer-facing UX where slow tail responses cause real friction. Less useful for batch workloads where tail latency doesn't matter.
  • Provider reliability varies. If your primary provider is rock-solid, the hedging benefit is small. If your traffic depends on a less-reliable secondary provider, hedging pays off.

Failover vs speculative routing

Failover and speculative are adjacent but different. Failover is sequential — try provider A, if it fails try B. Speculative is parallel — try A and B at the same time, take the first. Failover's cost is paid only on failure (healthy requests cost nothing extra); speculative's cost is paid on every request (the loser's tokens are wasted regardless). Production gateways often run both: speculative on sport-mode tier for latency hedging, failover on all paid traffic as a reliability baseline. See multi-provider failover for the sequential side.

How Prism implements it

Prism's v1.5 Router Hardening pillar adds speculative routing to sport mode on Pro+ accounts. The dispatcher fires the primary provider call and the first healthy fallback (determined by capability-tier matching from the same routing table that drives failover); whichever finishes first wins. The loser's task is cancelled best-effort — actual cancellation requires the loser to close the HTTP connection, which takes a few hundred milliseconds. Provider health is observed only for the winner (the loser was racing fairly; cancelling it isn't evidence of unhealthiness).

Free and balanced/eco modes don't get speculative — the cost overhead isn't worth it on cheap models, and the hedging benefit is smaller when the underlying provider latency is already fast.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Is speculative routing the same as multi-model synthesis / fusion mode?
No — different selection logic. Speculative fires N providers and takes the FIRST response (latency hedging). Fusion mode fires N providers and synthesises across ALL responses with a judge model (quality wedge). Same fan-out shape, different goal. Prism uses the same underlying dispatch primitive for both, with different selection logic on top.
How does the loser's cancel work?
Best-effort, via asyncio task cancellation on the gateway side and HTTP connection close. The actual cancel takes a few hundred milliseconds to propagate over the HTTP connection; during that window the loser provider keeps generating tokens, which get billed. In practice this means ~1.3x effective token cost averaged across all speculative calls.
Is there an outage scenario where speculative routing makes things worse?
If both providers are slow, speculative produces a slow response and pays for both calls. The cost is real but the impact is limited — if both providers are slow, the user was getting a slow response anyway; the speculative version just costs more. The case where speculative is unambiguously worse is on healthy traffic where the primary would have responded in 500ms and the speculative call costs 1.3x the tokens for no latency improvement.
Why doesn't every paid request get speculative routing?
Cost. The 30% token overhead on every request is meaningful at scale — on a $5K/month bill that's $1,500/month in hedging cost. Eco and balanced modes typically don't need it (cheap models are fast); sport mode does (the customer already declared they prefer quality + speed over cost). Restricting speculative to sport-mode + Pro+ keeps the cost where the value is.