Multi-region LLM serving in 2026: edge cache replication, latency budgets, and the cost of going global
Last updated:
· 18 min readThe complete guide to serving LLM API traffic globally — edge gateway patterns, cache replication, latency budget engineering, measured numbers from production.
Multi-region LLM serving is the practice of fronting an AI API with infrastructure close to your users — typically a Cloudflare Worker, Vercel Edge Function, or equivalent at every point-of-presence — and replicating the cache layer globally so that the second request from any region hits warm cache regardless of where origin lives. Done well, it cuts international cache-hit latency from ~700ms to under 200ms while keeping origin-billing complexity bounded to one region. This guide walks through the architectural patterns, the latency-budget math, what to replicate vs what to centralise, and the measured numbers from running this in production. Written for engineers picking where to put the gateway and what to do at the edge versus origin.
Why this matters in 2026
Three forces converged to make multi-region LLM serving a meaningful engineering surface rather than a niche concern:
1. AI products go global before they go big. Unlike B2B SaaS where international expansion is a deliberate sales motion, AI products show up in international traffic from day one — a developer in São Paulo, a chatbot embedded on a site indexed by Google in Tokyo, an internal tool deployed across an engineering team in Berlin. The product owner doesn't choose to go international; the traffic just appears.
2. LLM provider regions are concentrated. Most LLM providers — OpenAI, Anthropic, Google — have inference capacity in 2-4 regions globally, dominated by US-East and EU. A request from Singapore to OpenAI's US-East endpoint pays ~250ms of network latency before the model even starts processing. A request from Mumbai to Anthropic's US-West pays similar. The bottleneck is geography, not the model.
3. Cache hits are the latency win. When 30-60% of production AI traffic can be served from cache, the question of where the cache lives becomes a first-order latency lever. A cache hit served from origin (one region) plus a 250ms round-trip to the user is no faster than a cache miss. A cache hit served from an edge PoP near the user is a sub-200ms response — fast enough that users perceive the AI as instant.
The shape that emerges is: origin runs in one region (where the provider integration, billing, and database live), edges run everywhere (handling auth + cache lookup + simple short-circuits), and cache replicates globally with eventual consistency. Different cuts of this pattern, and the latency math behind them, are what the rest of this guide unpacks.
The 4-tier latency map
Before designing edge architecture, it's useful to know how the latency contributions stack up. Measured numbers, from production Prism traffic on the Mumbai origin + Cloudflare edge architecture (verify against your own infra; numbers are illustrative for non-Prism deployments):
| Tier | Path | Measured p95 |
|---|---|---|
| Cache hit at edge PoP | User → nearest Cloudflare PoP → KV lookup → return | <100ms in-region, ~180-220ms cross-region |
| Cache hit at origin | User → edge → origin region → cache lookup → return | ~480ms from APAC to Mumbai origin |
| Cache miss, in-region | User → edge → origin → provider call (same region) | ~800-1,200ms |
| Cache miss, cross-region | User → edge → origin (Mumbai) → provider (US-East) | ~1,600-2,200ms |
| Cache miss with provider-native passthrough | as above, with the cached prefix discount on input tokens | ~1,400-1,800ms (slight improvement on the input-processing side) |
The numbers are workload-dependent (model choice, prompt size, output token count all dominate the cache-miss latency), but the ratios hold up. A cache hit at edge is roughly 4-10x faster than a cache miss at origin. Multi-region serving exists to make that 4-10x apply to as much traffic as possible.
VERIFY (founder): confirm the Singapore→Mumbai 184ms and 484ms numbers cited elsewhere in the doc map to the same measurements. If the edge measurements have been refreshed in v1.6 / v1.7-A telemetry, update accordingly.
Architectural patterns
Three patterns dominate production AI-gateway deployments. The choice depends on team size, traffic volume, and how much operational control you want.
Pattern 1 — Centralised origin, no edge
Origin runs in one region (typically wherever the founder is, or wherever the biggest customer concentration is). Every request from anywhere globally lands at origin via DNS routing.
- Pros: dead simple. One thing to deploy, one thing to debug. One database, one cache, one billing path. Latency for in-region traffic is great.
- Cons: international latency is bad. A Singapore user hitting a Mumbai origin pays ~80ms of network latency on top of every request, hit or miss. A US user hitting the same origin pays ~250ms.
- When to use: Day 0. Most products start here and stay here until international traffic becomes a real signal. There's no shame in this; premature edge complexity is a real cost.
Pattern 2 — Centralised origin + global edge cache
Origin still runs in one region. An edge layer (Cloudflare Workers + Workers KV, or Vercel Edge Functions + Vercel KV, or Fastly Compute@Edge + Fastly KV-Store, etc.) runs at every PoP globally. The edge handles auth and cache lookup. On a hit, the response is served from the nearest PoP without ever touching origin. On a miss, the request proxies through to origin, which calls the provider and stores the response — origin then writes the cache entry to the global KV, which propagates to every PoP within ~60 seconds.
- Pros: the 4-10x cache-hit latency improvement, globally. Origin is still one region (one database, one billing path) so the operational complexity stays bounded. KV propagation is eventually consistent, which is fine for cache (a slightly stale entry is acceptable).
- Cons: the edge layer is a new operational surface. You need to think about KV propagation delay, edge cold starts (sub-50ms on modern platforms but real), and the auth path — the edge needs enough credentials to verify API keys without round-tripping to origin.
- When to use: when international traffic crosses ~10% of total volume, or when measured international p95 latency starts being mentioned in customer feedback.
Pattern 3 — Multi-region origin with active-active database
The fully-distributed pattern. Origin runs in multiple regions (US, EU, APAC). Each region has its own database replica with cross-region replication. Each region has its own cache. Each request routes to the nearest origin via DNS or anycast, hits a local cache and a local database, makes provider calls from a region close to the provider, returns to the user.
- Pros: lowest possible latency for any user, any region. Resilience against single-region outages.
- Cons: complexity explodes. Cross-region database replication is a major engineering investment (conflict resolution, replication lag, split-brain handling). Billing and quota systems need to span regions consistently. Provider routing per-region means tracking provider-region health independently. Most teams that go here have an SRE team larger than most AI startups.
- When to use: when you're running >$100K/month in LLM spend with global distribution, OR when single-region failure is a real existential risk. Almost never the right starting point.
Pattern 2 deep dive — the edge cache layer
Pattern 2 is the realistic engineering target for most teams running international AI traffic. The detailed mechanics:
Edge auth
The edge needs to verify the customer's API key before serving any cached response — otherwise you'd be serving authenticated content to unauthenticated callers. There are two patterns:
- Edge-local key store. The edge has its own minimal copy of the API key validation table, kept in sync via the global KV. Sub-10ms validation. Adds operational complexity (key-revocation lag).
- Origin round-trip on first request. The edge proxies the first request of a session to origin, which validates and returns a session token; the edge then accepts the token for subsequent requests in the session. Lower auth complexity, higher first-request latency.
Most production AI gateways use the first pattern with revocation propagation in the KV-replication window (sub-60-second).
Cache lookup at edge
The edge computes the same SHA-256 fingerprint origin would compute (over normalized messages + model + temperature + top_p + max_tokens), looks up the fingerprint in the global KV. If present, return. If absent, proxy through to origin.
The trick that matters: the edge and origin must compute identical fingerprints. A discrepancy here is a silent bug — the edge thinks the cache is empty, origin thinks the cache has the entry, every request goes to origin even when cached. Fix: ship the fingerprint code as a shared library (TypeScript/JS for the edge, Python for origin) generated from a single spec.
Cache propagation
Origin writes to the cache when a new response comes back from the provider. The write goes to: local origin cache (Redis, immediate consistency), and the global KV (eventual consistency, ~60-second propagation window). PoPs around the world pick up the write as KV propagation completes.
The propagation window is the trade. During that window, the same request from two different PoPs may both miss the cache, both round-trip to origin, both compute the same response, both write to the KV — the cache is technically populated twice (or N times for N PoPs that miss in the window) before the propagation completes. This is wasteful but not incorrect. The wasted compute is a small fraction of total traffic on any meaningful scale.
What lives at origin
Even with a robust edge layer, several things stay at origin:
- The provider calls themselves. The edge can't call OpenAI or Anthropic — you don't want your API keys distributed to 200 PoPs. Provider calls happen from origin, with the response then cached + returned + written to the global KV for future edge hits.
- Database writes for usage logging. The edge logs hits to a queue; origin drains the queue and writes to the usage_logs table. Some platforms (Cloudflare Analytics Engine, etc.) let you push directly to a stream from the edge — that's fine too. The point is the durable record lives in a centralised system.
- Billing and quota enforcement. Per-account balance checks and rate-limit counters live at origin because they need transactional consistency. The edge can serve cached responses without touching them (cached responses are $0 cost), but cache-miss requests check balance at origin before dispatching to the provider.
- Embedding inference for semantic caching. Semantic cache adds an embedding-call cost on every miss; running that at the edge is theoretically possible but adds complexity. Most production deployments keep it at origin and accept that semantic-cache lookups are a "miss at edge, hit at origin" pattern.
Latency budget engineering
Latency budget is the engineering discipline of allocating milliseconds to each stage of the request and refusing to exceed the total. For an AI gateway aiming at a 200ms p95 cache-hit latency:
| Stage | Budget |
|---|---|
| User-to-edge network | 30ms (Cloudflare claims sub-50ms to 95% of internet users; real-world more like 30-80ms p95) |
| Edge auth | 5ms (cached validation) |
| Edge cache lookup (Workers KV) | 30ms p95 |
| Edge response assembly | 5ms |
| Edge-to-user response | 30ms (return path; usually mirrors the inbound) |
| Total cache hit at edge | ~100ms p95 |
Cache miss budget is dominated by the provider call (200-2000ms depending on model and prompt size), so the optimisation surface is exclusively the cache-hit path. Every millisecond shaved off the cache-hit path is a millisecond saved on 30-60% of total traffic.
The discipline that holds up: timestamp every stage with console.log or equivalent, accumulate the deltas, surface them in observability. If the cache-hit p95 starts drifting, the timestamps tell you which stage moved. Most regressions land in one of: edge cold start (a deployment), KV propagation lag (a regional incident), or auth-validation latency (cache eviction in the edge-local key store).
What semantic caching at edge looks like
The honest answer is: most teams don't run semantic caching at edge. It requires either an embedding inference at the edge (which a Cloudflare Worker can do via Workers AI, but adds 100-200ms) or a round-trip to origin for embedding (which defeats the point of being at the edge).
The pragmatic pattern: exact-match cache at edge, semantic cache at origin. Exact-match catches 5-15% of traffic — the slice that benefits most from sub-100ms edge latency anyway (cron jobs, deterministic queries, duplicate-submit user actions). Semantic catches another 25-50% at origin, where the latency is dominated by the embedding inference anyway and the extra origin-region round-trip isn't marginal.
When edge-side embedding gets cheaper (Workers AI is steadily improving), this calculus shifts. Today, the split is the right call.
Failure modes worth designing for
Multi-region deployments add new failure modes. The patterns to anticipate:
Edge-to-origin failure. The PoP can't reach origin (network partition, origin overload, origin region outage). Two options: return a 5xx from the edge (degrades gracefully but rejects the request), or serve a stale cache entry if one is present (degrades the freshness guarantee but keeps responses flowing). The right choice depends on your workload — a chatbot can serve stale; a real-time pricing system cannot.
KV propagation lag spike. A regional issue at the KV provider causes propagation to slow from 60s to 30 minutes. Cache writes happen at origin but don't reach edges. The visible symptom: cache hit rate at edges drops. The fix: monitor cache-hit-rate per-PoP, alert on drift, route around the affected provider if needed.
Cross-region cost surprises. Cache writes propagate to every PoP. If you have 200 PoPs and your cache turnover is high, you may be paying for 200x the KV bandwidth you expected. Mitigation: monitor KV write volume per day, set explicit caps, use TTL aggressively to keep cache turnover bounded.
Provider-region misalignment. Origin is in Mumbai but you're calling OpenAI's US-East endpoint. The provider call adds ~200ms of east-west latency. Mitigation: pick origin region with provider proximity in mind, or use providers with more region options (Google has APAC capacity; Anthropic increasingly does too).
Edge cold start. The first request to a PoP that hasn't served traffic recently pays a cold-start penalty (typically 20-50ms on Cloudflare Workers; much higher on AWS Lambda@Edge). Mitigation: keep PoPs warm with synthetic traffic, or accept the occasional cold-start hit (it's rare in production).
How Prism implements multi-region serving
Prism runs Pattern 2: origin in Mumbai (AWS EC2, ap-south-1) with a Cloudflare Worker (prism-edge) fronting the API at every Cloudflare PoP globally. Cache entries replicate via Workers KV with eventual consistency.
- Origin: EC2 t3.small in Mumbai. FastAPI behind nginx. Single-region by design — keeps billing, observability, and provider routing simple.
- Edge:
prism-edgeWorker deployed to all Cloudflare PoPs. Handles auth (against an edge-local key store synced via KV) and exact-match cache lookup. Sub-100ms p95 in-region; ~180-220ms p95 cross-region. - Cache replication: Workers KV is the global cache substrate. Cache writes at origin write through to KV; edges pick up the write within ~60 seconds. Cache TTL configurable per-project on Pro+.
- Semantic cache: lives at origin (Mumbai) only. Cross-region semantic-cache hits pay the round-trip; this is intentional because edge-side embedding inference adds more latency than the round-trip saves.
- Provider calls: always from origin. The customer's provider keys (BYOK arriving in v1.9) and Prism's managed keys never leave Mumbai.
- Measured numbers (v1.6 deployment): Singapore-origin cache hits at 184ms; Upstash-only Mumbai-origin fallback at 484ms; pre-edge centralised architecture (deprecated 2026-05) at ~700ms.
VERIFY (founder): confirm the v1.6 edge deployment date and the exact CloudWatch / observability numbers. If the architecture has changed since v1.6 (additional PoPs, KV region preferences, etc.) reflect those updates.
Build vs buy
If you're tempted to build Pattern 2 yourself rather than adopt a gateway that ships it:
Build if:
- You already have a Cloudflare account, Workers experience, and a global KV provider relationship
- You have a specific compliance or data-residency requirement that doesn't fit a managed gateway
- You operate at scale where the per-request markup on a managed gateway exceeds the engineering cost
Buy if:
- You want a working edge layer in days rather than weeks
- The semantic-cache + provider-passthrough + observability surface is value you'd build anyway
- You're not certain about long-term geographic distribution and want flexibility
Pattern 2 is non-trivial engineering — 2-4 weeks for a competent team to deploy and stabilise, plus ongoing operational discipline (monitoring KV writes, edge cache hit rates, propagation lag). Most teams running this in-house have a dedicated platform engineer; teams without that capacity are better served by a managed AI gateway that ships the layer as a product.
Decision framework
If you're deciding what to do for your workload:
- Don't go multi-region until international traffic is real. Pattern 1 (single region) is the right answer until you're seeing >10% of traffic from outside your origin region.
- When you do go global, jump to Pattern 2. Multi-region origin (Pattern 3) is rarely worth the complexity at any sub-enterprise scale.
- Edge handles cache + auth; origin handles everything else. Don't try to put the model calls or the database at the edge.
- Exact-match cache at edge, semantic at origin. The right tradeoff today; revisit when edge-side embedding gets faster.
- Latency budgets matter. Allocate explicitly; measure per-stage; alert on drift.
- Plan for KV propagation lag and edge-to-origin failures. They'll happen.
The economics of edge-fronted AI gateways are unusually favourable for international workloads — the latency improvement is dramatic, the engineering work is bounded, and the cost (Cloudflare Workers + KV pricing scales sub-linearly to traffic) stays small relative to AI spend.
Where to go next
If you're comparing AI gateways that ship multi-region as a product: Prism vs Portkey, Prism vs Cloudflare AI Gateway (when published), and Prism vs Helicone.
If you're combining edge serving with the broader caching wedge: AI API caching walks through the three-layer cache stack that lives behind the edge.
If you want concrete numbers from real workloads: the savings calculator models your workload's impact at the cache layer; the latency story is workload-specific and harder to model generically.
Frequently asked questions
Should I use Cloudflare Workers AI to run the LLM at the edge instead?
For some workloads, yes. Workers AI runs small open-weights models at the edge with sub-100ms first-token latency — appropriate for classification, routing decisions, simple summarisation. Not appropriate for the full Claude / GPT / Gemini-class workload that's the bread-and-butter of most AI products. The hybrid pattern is: small at edge (classification, routing), large at origin (the actual customer-visible inference). That's a different architecture than what this guide covers.
Why not use Vercel Edge Functions instead of Cloudflare Workers?
Both work. Vercel Edge runs on Cloudflare's underlying network, so the global latency profile is similar. Choice between them depends on the rest of your stack — if you're already on Vercel for the frontend, Vercel Edge is the natural fit; if you're running your own infrastructure outside Vercel, Cloudflare Workers + Workers KV gives more direct control. Prism uses Cloudflare Workers directly because Workers KV is the cache substrate we want to control.
What's a realistic edge cache-hit rate?
Same as the per-layer cache-hit rate at origin. The edge doesn't increase hit rate; it just serves hits faster. Exact-match catches 5-15% of traffic; semantic catches another 25-50% on workloads with paraphrasable intent. The edge layer applies to the exact-match share (since semantic stays at origin in the typical pattern).
Can I run origin in multiple regions for resilience without going full multi-region (Pattern 3)?
Pattern 2.5: origin in one primary region, a read-only standby in a second region, with manual failover. Possible but operationally unusual — most teams either run single-region (Pattern 1/2) and accept the regional risk, or go full multi-region for global products. The middle ground is usually a Pattern 3 starter that hasn't finished propagating the database side.
Does the edge layer add any cost beyond Cloudflare Workers + KV pricing?
Yes — operational time. Monitoring, alerting on KV propagation lag, debugging when a PoP misbehaves, managing edge-local secrets. Plan for ~2-4 hours/month of platform-engineer time at steady state, more during incident response. Cloudflare Workers + KV themselves cost very little (Workers free tier covers most small deployments; KV is sub-$10/month at moderate scale).
How does this interact with provider rate limits?
Provider rate limits live at origin (since that's where provider calls happen). The edge layer doesn't change them. If you're bumping against provider rate limits, the fix is provider-side (request rate-limit increases, key rotation, multiple keys) — not edge-side.
What about GDPR / data residency?
The edge layer can hold cached responses globally, which means cached requests originally made by EU users may have their responses replicated to non-EU PoPs. For most chat/Q&A workloads this is fine. For workloads handling regulated data (PII, health information), you may need to restrict cache replication to specific PoPs or skip the edge layer entirely for those projects. Cloudflare supports region-restricted KV namespaces; configure accordingly.
Are there providers with truly global inference today?
Increasingly yes — Google's Vertex AI has APAC regions; Anthropic added APAC; OpenAI has limited APAC capacity. The choice of provider region affects where your origin should live: optimise for "origin close to the provider you call most" to minimise origin-to-provider latency on cache-miss requests.
If you're reading this because international traffic is hurting your AI product's p95 — the savings calculator models the broader cost story, and the AI API caching guide covers the layers that sit behind the edge.
Deep dives on multi-region llm api
Five cluster posts unpack the sub-topics of this pillar. Each ships independently as part of the content calendar.