Speculative routing
Firing the same request to two AI providers in parallel and returning the first successful response — a tail-latency reduction technique.
How it works
Speculative parallel routing is a latency-hedging mechanic for LLM API requests: the gateway fires the primary model call and the first healthy fallback simultaneously, returns whichever response arrives first, and cancels the loser mid-flight. The technique trades ~1.3x average token cost for a meaningful reduction in p99 latency, especially when one provider degrades while another stays healthy.
The mechanic is borrowed from speculative execution in distributed systems: when waiting for an answer is expensive, do the work twice and take the first answer back. For LLM gateways, "expensive" is user-perceived latency on a slow provider; "doing the work twice" is two provider calls.
What it buys you
Lower p99 latency. When one provider returns 200s in 8 seconds and the other in 800ms, the customer sees the 800ms response. Without speculative, the 50% of requests routed to the slow provider would have seen the 8-second response.
Resilience to single-provider degradation.Provider outages and capacity issues that would otherwise propagate to the application as slow responses or 5xx errors get smoothed over by the healthy second provider. The customer doesn't notice the degradation.
Implicit failover.If one provider returns an error, the other's success still propagates. The error path is "both failed" rather than "primary failed, retry against secondary."
What it costs
Token cost.The loser provider keeps generating tokens until the cancel propagates over the HTTP connection — typically a few hundred milliseconds' worth of tokens, which get billed normally. Empirically the average token cost is ~1.3x a serial call; better on short generations (less time for the loser to generate), worse on long generations.
Streaming complexity. Mid-stream speculation is messy. If both providers are streaming, the gateway has to demux two SSE streams, the customer sees first-token from whichever responds first, and switching mid-stream is hard. Most production implementations skip streaming entirely under speculative routing.
When it makes sense
Speculative routing earns its complexity when:
- Sport mode / max-quality tier. The caller has already declared they prefer quality over cost; spending 30% more tokens for latency hedging fits the existing trade.
- p99 latency is a real product concern. Customer-facing UX where slow tail responses cause real friction. Less useful for batch workloads where tail latency doesn't matter.
- Provider reliability varies. If your primary provider is rock-solid, the hedging benefit is small. If your traffic depends on a less-reliable secondary provider, hedging pays off.
Failover vs speculative routing
Failover and speculative are adjacent but different. Failover is sequential — try provider A, if it fails try B. Speculative is parallel — try A and B at the same time, take the first. Failover's cost is paid only on failure (healthy requests cost nothing extra); speculative's cost is paid on every request (the loser's tokens are wasted regardless). Production gateways often run both: speculative on sport-mode tier for latency hedging, failover on all paid traffic as a reliability baseline. See multi-provider failover for the sequential side.
How Prism implements it
Prism's v1.5 Router Hardening pillar adds speculative routing to sport mode on Pro+ accounts. The dispatcher fires the primary provider call and the first healthy fallback (determined by capability-tier matching from the same routing table that drives failover); whichever finishes first wins. The loser's task is cancelled best-effort — actual cancellation requires the loser to close the HTTP connection, which takes a few hundred milliseconds. Provider health is observed only for the winner (the loser was racing fairly; cancelling it isn't evidence of unhealthiness).
Free and balanced/eco modes don't get speculative — the cost overhead isn't worth it on cheap models, and the hedging benefit is smaller when the underlying provider latency is already fast.