Last updated:

Multi-provider failover

Automatically routing a request to a backup AI provider when the primary returns an error or times out.

How it works

Multi-provider failoveris the reliability mechanism by which an AI gateway, when its chosen provider fails or times out, automatically retries the request against a different provider that hosts an equivalent model. The failure is invisible to the caller — they get a successful response from a provider they didn't pick. The mechanic addresses provider outages, capacity issues, and transient errors that would otherwise propagate to the application as 5xx responses.

The simplest pattern: define a primary provider and a sequence of fallback providers per model class. On a failure (5xx response, timeout, connection error), dispatch to the next provider in the sequence. Repeat until success or until the fallback chain is exhausted. Modern gateways add provider health monitoring — recent failure rates per provider are tracked in a rolling Redis window, and unhealthy providers are skipped over rather than retried into.

Failover vs routing

Failover and routing are adjacent but distinct concepts. Routing is "which model should this request go to" — proactive selection on every request based on intent, task type, mode, or policy. Failover is "the model I sent it to is unhealthy, send it somewhere else" — reactive recovery after a failed attempt. A production AI gateway needs both: routing picks the primary model, failover handles the case where the primary doesn't respond cleanly.

The failover chain

The structural decision is what counts as an "equivalent model" for failover purposes. Three patterns:

Capability-tier matching. Models are grouped into capability buckets (small / medium / large / frontier). On failover, the gateway picks a model in the same bucket from a different provider. A Claude Sonnet failure failovers to GPT-4o; a GPT-4o-mini failure to Claude Haiku. This is what Prism uses (v1.5 hardening pillar) — a 6-bucket index keyed in `router.MODEL_CAPABILITY`.

Fixed equivalent mapping. Each model has a hand-coded fallback equivalent. Less flexible than capability-tier matching but easier to reason about for small catalogs.

No model swap, just retry. Fallback to a different provider hosting the same model (where multiple providers offer the same open-weights model). Common in OpenRouter-style aggregation across providers offering Llama or DeepSeek deployments.

Streaming failover

The hard edge case. If a streaming response fails mid-stream (the provider drops the connection after returning partial tokens), failover is operationally complex — the gateway has to decide whether to abort the original stream, start a fresh stream on the fallback provider, and how to communicate the change to the caller. Most production gateways skip mid-stream failover and instead fail clean to the caller, who can retry. Failover on non-streaming responses is the well-defined path.

Failover vs speculative parallel routing

A more aggressive pattern: speculative parallel routing fires the primary and the first fallback simultaneously on every request, returns whichever finishes first, cancels the loser. Costs ~1.3x token spend in exchange for p99 latency hedging. Different mechanic from failover (which is sequential). See speculative-routing for the deeper dive. Prism runs speculative on sport-mode requests for Pro+ accounts; failover applies to all paid traffic regardless of mode.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Should every request have a failover chain?
Most production deployments enable failover on all paid traffic. The cost is near-zero (failover only fires on a failed call; healthy calls pay nothing extra), the upside is real (~3-5% of production traffic typically hits at least one provider error per week). Free-tier traffic often skips failover for cost reasons; paid traffic almost universally has it on.
Does failover affect response time on healthy traffic?
No — failover is a recovery path. On healthy requests, the primary provider responds successfully and failover never fires. Only when the primary fails does failover engage, adding the latency of the second provider call. The customer's perceived latency on a failover-recovered request is roughly 2x a normal call (first attempt timed out + fallback succeeded).
What about provider rate limits — does failover help with those?
Yes. A provider rate-limit response (typically HTTP 429) is treated as a failure by the failover layer, and the request retries against the next provider. This is one of the most common failover triggers in practice — provider rate limits hit frequently on bursty workloads, and silently failing over to a different provider keeps the application running smoothly.
How does Prism's failover work?
Prism's v1.5 Router Hardening pillar implements capability-tier failover with rolling-window provider health monitoring in Redis. On a primary failure, the gateway looks up the failed model's capability bucket, identifies an equivalent model on a different (healthy) provider, and retries. Unhealthy providers are skipped entirely on subsequent requests until their health score recovers. Sport-mode on Pro+ adds speculative parallel routing on top (fire two in parallel, take the first response).