← Back to blog
·10 min read·By

We added 5 providers and the router got smarter

v1.7-A triples Prism's model catalog — from 7 models on 3 providers to 23 models on 8 providers, all direct integrations. Routing-table rewrite based on a 552-call benchmark suite. The wedge in practice.

aiapiprovidersroutingbenchmarkdeveloper-toolsinfrastructure

The hardest version of "we added more models" is the boring one: a marketplace adds providers because more is more. A control plane adds providers because each one earns its slot in the routing table by being measurably the right pick for some class of request. The first version is easy. The second is the only one worth shipping.

This week we shipped v1.7-A. Prism now routes across 23 models on 8 providers, all direct integrations, no marketplace markup. The seven incumbent models (Claude Opus/Sonnet/Haiku, GPT-4o/4o-mini, Gemini 2.5 Pro/Flash) are joined by 16 new models from five new providers: Groq, DeepSeek, Fireworks, Cerebras, and Mistral. Eight model architectures total — Claude, GPT, Gemini, Llama, Qwen, DeepSeek, Mistral, GLM, Kimi, GPT-OSS — span the catalog.

This is the post that explains why each one, and what changed in the auto-router because of it.

The wedge it sharpens

Prism's positioning is "the gateway that picks the model for you." Every other gateway makes the developer pick. We classify the request, look at the mode header (eco / balanced / sport), and route. That's been true since v1.0. What's been less true, until this week, is that we had enough models for the picking to be interesting.

Seven incumbents from three providers is a starter catalog. You can do eco/balanced/sport routing with seven models, but the choices were narrow: claude-haiku for eco, claude-sonnet for balanced, claude-opus for sport, repeated across task types. Cheap-and-fast meant Anthropic's smallest model. There wasn't a real alternative to Claude in the eco bucket. The auto-router could pick — but the picks looked more like "Anthropic by default" than "the right model for this request."

23 models changes that. There is now a genuinely fast eco-class option (Llama 3.1 8B on Groq, sub-second response, ten cents per million tokens). There's a frontier-class option that isn't Claude or GPT (Qwen 235B on Cerebras, or DeepSeek V4 Pro). There's a code-specialized model (Codestral). There's a reasoning specialist (Magistral Medium). When the router classifies your request as "code" and you've asked for sport mode, "the right model" is no longer "whatever Anthropic's biggest is." It's a model that's actually built for code.

The routing table got 9 distinct models across 12 cells, up from 4 across 12 in v1.0. Six different providers are now in the auto-routing pool. That's the picking-for-you story made real.

Why no Together AI, no OpenRouter, no marketplace

When you're adding providers to a catalog there's a tempting shortcut: integrate one marketplace and you suddenly have 200+ models. Together AI hosts most of the popular open-weight models. OpenRouter has 300. Either one would have given us an instant catalog without writing five separate adapters.

We deliberately didn't take that path. Prism's positioning, since v2 roadmap was locked in April, is control plane, not marketplace. The distinction matters: a control plane owns the routing decision and the customer relationship; a marketplace is a middleman that takes a cut for each call routed through it. If we proxy 80% of our traffic through Together or OpenRouter, our cost structure is wrapped around theirs and our routing decisions are constrained by their hosting choices. That's not a wedge we want.

So every one of the five new providers is a direct integration. Adapter file in services/providers/, API key in EC2 env, billing.py prices read from their actual pricing pages. Each one is a 401 away from being our problem if it fails. That's the price of the positioning. It also means Llama 3.3 70B is on Groq AND on Cerebras AND on Fireworks (at the time of integration), and we pick which one to use for which routing slot based on their actual strengths — Groq for cheap-and-fast Llama, Cerebras for sub-100ms inference, Fireworks for specialty models like Kimi and GLM that aren't elsewhere. Three direct relationships instead of one marketplace relationship.

That choice is reversible if it stops making sense — OpenRouter Fusion-style integration ships fast if we ever need it. But for v1.7-A, eight direct providers is the call.

The benchmark that drove the routing table

You don't rewrite a production routing table from intuition. We wrote a benchmark suite (scripts/benchmark_models.py) that does four things:

  1. Fires 3 prompts per task type (simple, code, reasoning, complex) at every model in the catalog. 12 prompts × 23 models = 276 calls. (We tried 10 prompts per task as a higher-confidence pass; it ran out of prepaid founder balance ~240 calls in and only gave us full data for 5 incumbent models. The 3-prompt MVP run is what actually drove the routing table.)
  2. Captures latency, cost, and the response text for each call.
  3. Sends each response to a judge model (Claude Sonnet) with a 1-10 rubric prompt asking "how well does this response answer the original prompt?"
  4. Aggregates per-(model, task_type) average quality, average latency, and average cost; then picks the right model per (task_type × mode) cell based on the appropriate cost/quality tradeoff.

The first time we ran it, every model scored identically. That was suspicious. It turned out Prism's semantic cache was working too well — the first model to answer a given prompt populated the cache, and every subsequent model with X-Prism-Model-Prefer set was getting the cached response from the first one, because the semantic cache hash doesn't include the model name. We bypassed cache for the benchmark by adding X-Prism-Cache: off and re-ran. The second run actually exercised each model. Caching that aggressively is a production feature; in a benchmark context it's a bug we papered over with a header.

The actual numbers — every model's quality, latency, and cost per task — are committed to the repo at benchmarks/v1.7-A-2026-05-22/ for anyone who wants to argue with our picks.

What changed for live traffic

The new routing table went live on 2026-05-22. From the customer side, the visible changes are:

Free tier eco-mode calls used to all route to a Claude Haiku family model. Now they route to Llama 3.1 8B on Groq for simple and reasoning tasks, Llama 3.1 8B on Cerebras for code, and Llama 3.3 70B on Groq for complex. Cost-per-call dropped between 50% and 95% depending on task. Customer-visible response stays the same (the eco mode benchmarks showed equivalent quality at this token-count range). Our markup margin is preserved.

Pro+ sport-mode calls used to default to Claude Opus across every task. Now they diversify: Opus stays as sport for simple and reasoning (where it's still the highest scorer), but sport for code is Mistral Medium (the actual highest scorer for code), and sport for complex is Gemini Pro (the only model that scored above 9 on long-context multi-step prompts). The benchmark surfaced what the homepage marketing was already claiming: different tasks want different models, and "best regardless of cost" depends on what "best" means for THIS request.

Direct dispatch via X-Prism-Model-Prefer works for any of the 23 models, with one tier rule: Free tier can direct-dispatch any incumbent (Claude, GPT, Gemini) but the five new providers are gated to Pro+. Free's mode-based routing is unaffected — the auto-router can pick from the full catalog regardless of tier.

Failover got a structural change. The v1.0 failover map was 7×7, one entry per "if model X fails, try model Y on provider Z." That doesn't scale to 23×23 = 529 entries. We replaced it with a capability-tier index: every model is tagged small / medium / large / frontier / code / reasoning / long-context, and the failover function picks an equivalent-tier model from a different provider. The fallback chain for claude-sonnet (large) used to be gpt-4o then gemini-pro — two candidates. Now it's gpt-4o, gemini-pro, groq-llama-70b, groq-llama4-scout, groq-gpt-oss, fireworks-glm-5p1 — six candidates across four providers. Provider failures are much less likely to surface as customer-visible failures.

What's not in v1.7-A

Three deliberate omissions worth being explicit about.

No xAI in production. The Grok-3 / Grok-2 adapters are in the codebase, the env-var slot is in config.py, the routing table slot is reserved. We don't have credits funded on the account yet. xAI gives $25 free with an active X account; the account exists but the credit isn't claimed. That's a 5-minute action item; it just hasn't happened yet.

No Perplexity Sonar. Same shape — credit-card-gated signup, deferred. Sonar models have built-in web search which is a routing category we don't currently serve at all; integrating it will expand the routing taxonomy (a new task_type beyond simple/code/reasoning/complex) rather than just add another model. Worth doing right, in its own release.

No DeepSeek-routed traffic yet. DeepSeek V4 Flash and V4 Pro are in the adapter layer, in MODEL_PROVIDER, in MODEL_PRICES. They benchmarked beautifully (V4 Flash scored 10/10 on code, the highest in the catalog). The account just needs $5 of credit to activate. Our position: don't fund a provider account until there's revenue justifying it. First paying customer who'd benefit from DeepSeek, we top up. Until then DeepSeek sits in EXCLUDED_PROVIDERS and the router skips it during failover candidate selection. It's all wiring; flipping it on is a one-line change.

What's worth the boring honesty

The MVP version of the benchmark used 3 prompts per task type instead of 10. The data was noisier (small models occasionally scored 10/10 on simple prompts that any modern LLM nails, and that inflated their cross-task averages). The 10-prompts-per-task version we shipped the routing table from is tighter but still has limits — 40 total prompts isn't a comprehensive eval suite. The right way to keep tuning this is to watch real production traffic, capture feedback (the thumbs-up endpoint shipped in v1.3 collects this), and re-run the benchmark with categories the real traffic actually hits.

The catalog choices were also made under what's-shipping-this-week constraints. Groq's catalog evolves; some of the models in our routing table today (Llama 4 Scout) didn't exist when v1.6 went out; some of the ones we considered (Mixtral) are no longer in their catalog. The right artifact to trust is the /v1/public/models endpoint, which reads from the live code; everything in this blog post is a snapshot of what shipped on 2026-05-22.

What this is laying groundwork for

The wedge being sharper matters for two adjacent things on the roadmap.

Multi-model synthesis (gap #8 in competitive-gaps.md). OpenRouter shipped Fusion in March: fan out the same prompt to N models, use a Judge model to synthesize the strongest parts of each response into a final answer. We have the dispatch infrastructure from speculative routing (v1.5) and now we have a real catalog to fan out across. The infrastructure exists; the only missing piece is the Judge step. That's a v1.7-B candidate.

Customer trust in the routing decision. "Prism picks the model for you" is only persuasive if the customer can verify the picking. The /models page now exists with live data — what's in the catalog, which routing slot each model fills, which providers are active vs deferred. The "explain my route" debug endpoint (gap #7) is the next layer down: per-request, why did Prism pick THIS model. That's also a v1.7 candidate. Both make the abstraction less black-box.

Try it

If you already have a Prism key, mode-based routing is unchanged in shape — set X-Prism-Mode: eco or balanced or sport and the new routing table picks the right model. To force a specific model:

curl -X POST https://api.ssimplifi.com/v1/chat/completions \
  -H "Authorization: Bearer prism_sk_..." \
  -H "X-Prism-Mode: balanced" \
  -H "X-Prism-Model-Prefer: groq-llama-70b" \
  -d '{"messages": [{"role": "user", "content": "Explain the second law of thermodynamics in two sentences."}]}'

If you don't have a key, signup gives you 50K free input tokens daily, eco mode unlocked. That's the new groq-llama-8b path — cheapest 8B Llama serving on the planet, our spend, no card required.

The live catalog: ssimplifi.com/models. The benchmark data: in the repo at benchmarks/v1.7-A-2026-05-22/. The wedge: now real.