Last updated:

LLM routing

Dynamically choosing which AI model handles each request based on task type, latency budget, and cost tolerance.

How it works

LLM routing is the decision logic that picks which model handles each request when an application has access to multiple LLMs. A router sits between application code and one or more provider APIs; on every incoming request, it evaluates routing signals (the request itself, headers, metadata, prior history) and selects a model. The decision can be deterministic (a fixed rule like "use GPT-4o for code, Claude Sonnet for everything else"), policy-driven (per-project rules from a governance layer), or learned (a classifier that infers task type and matches against a routing table).

The mechanic matters because no single model dominates the cost-quality frontier across all workloads. Frontier models like Claude Opus or GPT-5 produce the best output but are 50-100x more expensive than small fast models like GPT-4o-mini, Claude Haiku, or Llama-3.1-8B. A router that picks the right model per request captures most of the quality from the expensive models on the requests that need it, while routing the rest to cheap models — typically cutting average per-request cost by 40-70% without measurable quality degradation on the workloads that don't need frontier capability.

Common routing strategies

Mode-based routing exposes a small set of named modes (e.g. eco / balanced / sport, or fast / quality / max-quality) and lets the caller declare which mode the request needs via a header or kwarg. The router maps mode + task classification to a model. This is the simplest abstraction for application developers — declare intent at the call site, let the router pick the specific model.

Task-classifier routing runs a small fast classifier (often a fine-tuned mini-LM or an embedding-based similarity score against a labelled corpus) on the incoming prompt, predicts a task type (simple / code / reasoning / complex), and looks up the right model in a routing table. The classification cost is in the tens of milliseconds and below 1¢; the savings on routing-table-driven model selection typically dwarf the classifier overhead by 50-100x.

Cost-quality optimisation routing uses an explicit cost-budget or quality-floor constraint per request, then picks the cheapest model that meets the floor (or the highest-quality model under the budget). Requires per-model quality benchmarks for the workloads in question, refreshed as model catalogs evolve.

Policy + governance routing applies per-project rules — denied models, forced model-by-task overrides, or compliance-based selection (e.g. force EU-resident models for GDPR-sensitive workloads). Usually layered on top of one of the other strategies as a constraint, not as the primary routing logic.

What good routing looks like in production

The signature of well-engineered routing is that average request cost drops materially without an increase in user-visible quality complaints or feedback regressions. Production deployments typically see 40-60% cost reduction on workloads with mixed complexity, and negligible quality impacton the simple-task slice that gets routed to cheap models. The wins compound with caching — routing reduces the cost of calls that aren't cached; caching avoids many of those calls entirely.

The failure modes worth instrumenting: routing decisions logged per request (so you can audit "why did this go to the cheap model?"), quality feedback captured per response (thumbs-up/down + comments tied back to the routing decision), and model-by-task hit-rate dashboards that show whether the routing table is still valid as model catalogs evolve.

Routing vs failover

Routing is "which model should I send this to" — proactive selection per request. Failover is "the model I sent it to is unhealthy, send it to a different one" — reactive recovery after a failed request. Both belong in production gateways; they solve different problems. Prism's router picks the model based on mode + task classification, then the failover layer takes over if the chosen provider returns 5xx or times out. See multi-provider failover for the recovery side.

How Prism implements it

Prism's router combines mode-based + task-classifier routing. The caller sets X-Prism-Mode to one of eco / balanced / sport on each request. A small classifier (fine-tuned for LLM workload taxonomy) tags the request with one of four task types — simple, code, reasoning, or complex — and the router looks up the (task_type, mode) cell in a routing table calibrated from a measured benchmark across 23 models on 8 providers. Pro+ accounts can pin specific models via X-Prism-Model-Prefer when they want direct control. The router decision lands in X-Prism-Model on every response, plus the task classification in X-Prism-Task-Type. Sport mode on Pro+ also fires speculative parallel routing — two providers in parallel, first response wins — to hedge p99 latency under provider degradation.

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Do I need an LLM router if I'm only using one model?
No — routing is what you reach for when you have multiple models to choose from. With a single model, the routing decision is moot. Where it pays off is the moment you add a second model (typically a cheaper fast one to handle simple tasks) and want to direct requests intelligently rather than calling the same expensive model for everything.
How is LLM routing different from load balancing?
Load balancing distributes requests across multiple instances of the same model (or replica pools of the same provider) for throughput and reliability. Routing picks between different models with different capabilities and costs. Both can run in the same gateway — Prism, for instance, routes between models per mode + task type, then load-balances across multiple keys on the chosen provider if you've configured a pool.
Does the classifier overhead make routing slower than just calling a model directly?
Trivially — the classifier cost is typically 5-20ms and a fraction of a cent per call, against model calls that run 200-2000ms and 0.1-50¢ per call. The classifier overhead is in the noise; the savings on choosing the right model dominate by 50-100x.
Can I just hand-write a routing rules engine instead?
Yes, and many teams do start there. A fixed mapping like 'GPT-4o for code, Sonnet for everything else' captures most of the routing wins with zero ML infrastructure. The case for a classifier-based router shows up later, when the rule set gets unwieldy or when the task distribution is more varied than a hand-written rule set can cleanly handle. Most production setups end up combining both: explicit overrides for known cases, classifier-driven routing for the rest.