The LLM cost reduction playbook: 14 techniques, measured impact, ranked by ROI
Last updated:
· 14 min readEvery meaningful technique for cutting an LLM bill, ranked by measured production impact. From quick wins (caching, routing) to structural changes (model selection, prompt engineering).
There are 14 techniques that actually reduce LLM API spend in production, ranked here by measured ROI: caching catches 30-60%; routing the right model per request catches 30-50%; provider-native prompt caching catches 40-90% of input-token cost on stable-prefix workloads; prompt compression and structured output catch 10-30%; batching, streaming discipline, and quota-aware retries catch 5-15% each. Stacked together they routinely cut bills in half, sometimes more. This playbook is the technique-by-technique walkthrough — what each one does, how much it saves on which workloads, and the order to implement them. Written for engineers who want to cut a real bill, not for vendors selling theory.
The framing
Before walking through techniques, the framing that makes them rank-orderable: every dollar saved on LLM spend comes from one of three buckets.
1. Don't call the model at all. Caching is the king of this bucket — if you can return a cached response, you avoid the provider call entirely. Memoisation, deterministic routing rules, and certain async-batching patterns also fall here.
2. Call a cheaper model. Routing techniques live here. The right small model for a simple task costs 50-100x less than the right frontier model. Even modest task-type routing captures most of this wedge.
3. Call the model with fewer tokens. Prompt compression, output capping, structured output, provider-native prompt caching (which doesn't avoid the call but discounts the input tokens). Every saved token at the input or output is real money.
The techniques below are organised by which bucket they primarily belong to, ranked within each bucket by typical production ROI. Stack across all three buckets for the largest impact.
Bucket 1 — Don't call the model
Technique 1: Exact-match response caching
ROI: high. Effort: low. The cheapest, safest technique to deploy. Fingerprint each request (SHA-256 over canonicalised messages + model + sampling parameters), store the response in a key-value store like Redis, return on byte-identical repeats. Hit rate in production AI traffic typically 5-15%; the wins are pure savings (provably correct) with sub-10ms p95 lookup overhead. See exact vs semantic caching for the deep dive.
Technique 2: Semantic response caching
ROI: high. Effort: medium. Embed the prompt, look up nearest stored embedding, return cached response if cosine similarity exceeds a threshold (0.95 default). Catches 25-50% of remaining traffic on workloads with paraphrasable intent (customer support, FAQ, documentation Q&A). Stacks with exact-match. The complexity is threshold tuning + false-positive monitoring; the discipline is documented in the semantic cache glossary entry.
Technique 3: Memoisation at the function layer
ROI: medium. Effort: low. Above the gateway, in your application code: if the same input function is called repeatedly with the same arguments, memoise. A standard programming-language pattern that doesn't require any LLM-specific infrastructure. Catches the cases where the LLM call is wrapped in a function that the application calls multiple times per session.
Technique 4: Deterministic routing rules to skip the model
ROI: medium. Effort: low. For specific known patterns, don't even consider the LLM. "If the user types 'help', show the help menu — don't send it to the model." Sounds obvious; commonly missed because the LLM has become the default for everything. Audit your app for cases where a rule-based response is both faster and free.
Bucket 2 — Call a cheaper model
Technique 5: Task-type routing
ROI: very high. Effort: medium. Classify each incoming request by task type (simple / code / reasoning / complex), look up the right model in a routing table calibrated per task. A frontier model for reasoning-heavy work, a fast small model for extraction or formatting. Production deployments running this see 40-60% cost reduction on workloads with mixed task complexity. The discipline is detailed in task-type routing.
Technique 6: Mode-driven routing (eco / balanced / sport)
ROI: high. Effort: low. A simpler form of routing: the caller declares cost-quality intent via a mode flag, the gateway maps mode + task type to the right model. Doesn't require app-layer classification — just a header on each request. The right starting point for teams not ready to build a classifier; covers most of the savings of full task-type routing for most workloads.
Technique 7: Model downsizing on cache-miss
ROI: medium. Effort: low. When the cache misses, retry the request with a cheaper model. Worst-case: the cheap model gives a worse response and the customer notices. Mitigation: use only on workloads where quality is high but not critical (internal tools, low-stakes summarisation). Not the right call for customer-facing critical paths.
Technique 8: Provider redundancy + cheapest-route arbitrage
ROI: medium. Effort: medium-high. If multiple providers host equivalent models (e.g. Llama 3 hosted on Groq, Cerebras, Fireworks, Together), pick the cheapest healthy one per request. The price spread can be 2-3x for the same model across hosts; arbitrage captures that. The cost is provider-relationship complexity — managing multiple API keys, watching for individual provider degradation, handling per-provider rate limits.
Bucket 3 — Call with fewer tokens
Technique 9: Provider-native prompt caching
ROI: very high on stable-prefix workloads. Effort: low. Anthropic discounts cache-read tokens to 10% of normal input price (a 90% discount, with a 25% write premium on first request); OpenAI discounts cached tokens to 50% automatically on prompts ≥1,024 tokens. Pays off any time your system prompt or static context is stable across requests, which is most production traffic. The deep dive: prompt caching and provider-native caching.
Technique 10: Prompt compression
ROI: medium-high. Effort: medium. Reduce input tokens by compressing the prompt: trim redundant context, remove irrelevant retrievals, summarise lengthy chat history before sending. Specific techniques: sliding window with summary (keep the last N messages + a summary of older ones), tools like LLMLingua that compress prompts via a smaller model before sending to the large one. Production deployments see 20-40% input-token reduction with negligible quality impact when done well. Done poorly, quality regression is real.
Technique 11: Output capping with max_tokens
ROI: high on verbose workloads. Effort: trivial. Set max_tokens aggressively. If the response can be 200 tokens, don't leave the cap at the default 4096. Output tokens cost 4-5x input tokens; constraining output is one of the highest-leverage changes you can make in 30 seconds. The trade is occasionally truncated responses; the mitigation is sensible per-task max_tokens defaults.
Technique 12: Structured output / JSON mode
ROI: medium. Effort: low. Using JSON mode or structured output (vs free-form text) typically reduces output tokens by 30-50% because the LLM doesn't pad with explanatory prose. The trade: structured output is more constrained in style. Works well for extraction, classification, function-calling-shaped tasks. Doesn't work for "explain this to me like I'm five" workloads.
Cross-bucket techniques
Technique 13: Async batching with Batch API
ROI: very high for tolerant workloads. Effort: medium. OpenAI's Batch API offers 50% discount on chat completions when you submit a batch of requests and accept up to 24-hour processing latency. Same Anthropic on a similar pattern. For workloads that don't need real-time response (offline analysis, bulk classification, evaluation runs), this is one of the highest-leverage techniques available. Doesn't apply to anything user-facing where latency matters.
Technique 14: Streaming discipline + cancellation
ROI: small but real. Effort: small. Some applications send a long-running streaming request and then stop using the response before it completes — but the provider keeps generating tokens until the connection closes. Disciplined streaming: cancel as soon as the application has what it needs (the user navigated away, the downstream function got its answer, etc.). Modest savings (5-10% on workloads with frequent cancellations); meaningful at scale.
How they stack
The compound savings story is what makes the LLM cost-reduction discipline so favourable. A worked example: a 20K req/day support chatbot on Claude Sonnet, starting at $138/day baseline.
| Layer | Technique | Cumulative savings | Running total |
|---|---|---|---|
| Bucket 1 | Exact + semantic caching (~43% combined) | $59/day | $79/day |
| Bucket 3 | Anthropic prompt caching on system prompt (~60% input-token reduction on the calls that miss caching) | $9/day | $70/day |
| Bucket 2 | Task-type routing (~35% of remaining traffic to cheaper models with no quality impact) | $11/day | $59/day |
| Bucket 3 | max_tokens tuned per task type (~15% output-token reduction) | $5/day | $54/day |
| Total stacked | — | $84/day saved (~61%) | $54/day from $138/day |
VERIFY (founder): replace the worked example with one drawn from real Prism customer data or representative aggregated patterns at current provider pricing. The illustrative numbers are reasonable but worth grounding in production.
The point isn't the specific numbers — it's the shape. Each technique adds incremental savings on top of the previous ones, and the discipline compounds. The 61% reduction in this example is normal for well-instrumented production deployments; teams that don't deploy any of these techniques pay 2.5x what they need to.
Implementation order
The right order to deploy these techniques depends on team capacity, but the canonical sequence:
Week 1 — the quick wins (Bucket 1 + 3 cheapest):
- Set sensible
max_tokensper task type (Technique 11) — 30 minutes of work - Audit for rule-based skip-the-LLM patterns (Technique 4) — half a day
- Deploy exact-match caching (Technique 1) — half a day with a managed gateway, 2-3 days self-built
- Enable structured output where applicable (Technique 12) — varies by use case
By end of week 1, you should see 15-30% cost reduction on instrumented workloads.
Week 2-3 — the routing wedge (Bucket 2): 5. Deploy mode-driven or task-type routing (Techniques 5, 6) 6. Wire provider-native prompt caching for Anthropic + OpenAI traffic (Technique 9)
By end of week 3, cumulative reduction typically 40-55%.
Week 4+ — the optimisation continuation: 7. Deploy semantic caching with threshold tuning (Technique 2) 8. Evaluate Batch API for tolerant workloads (Technique 13) 9. Investigate prompt compression (Technique 10) — requires more careful quality monitoring
By month 2, well-instrumented workloads land at 50-65% reduction sustainably.
What doesn't work
Equally important: techniques that are talked about but don't actually reduce cost meaningfully.
Switching to a "cheaper provider" for everything. Across-the-board price arbitrage doesn't typically work because the quality differences across price tiers are real. Per-task routing captures the right cheap-model wins; per-everything routing trades quality for marginal savings.
Aggressive model quantisation / self-hosting open-weights. Theoretically cheap; practically expensive in engineering time and operational overhead unless you're at >$50K/month in LLM spend with serious infrastructure capacity.
Pre-emptively summarising user input. Adds an LLM call to the pipeline, often costs more than it saves unless your inputs are pathologically long.
Switching the entire stack to embeddings + retrieval to avoid LLM calls. Useful when applicable; not a general-purpose cost-reduction lever.
How Prism implements the wedge
Prism ships the highest-ROI techniques as default-on product features:
- Layer 1 + Layer 2 caching automatic on every Pro+ request — exact-match (Technique 1) + semantic (Technique 2) running concurrently
- Provider-native passthrough (Technique 9) with the discount passed through to the customer, not absorbed
- Mode-driven routing (Technique 6) via
X-Prism-Modeheader — eco/balanced/sport map to the right model per task type, calibrated from a benchmark - Per-feature cost attribution so you can see which Bucket-1/2/3 levers are saving the most on which features
- Public savings counter +
X-Prism-Cache-Saved-Centsheader per response — measurable savings, not vendor estimates
What Prism doesn't (yet) ship as a managed feature: prompt compression (Technique 10) and Batch API integration (Technique 13). Both are roadmap candidates; today both are application-layer concerns.
VERIFY (founder): confirm Prism feature mapping above matches the production tier matrix. Add or remove items based on what's actually live.
Decision framework
If you're standing up LLM cost-reduction discipline on a real team:
- Start with attribution. You can't reduce what you can't see. See LLM observability for the framework.
- Quick wins first. max_tokens + cache + rule-based skips deliver 15-30% in week one.
- Routing next. Mode-driven or task-type routing adds another 20-30% in week two.
- Provider-native passthrough is free money. Wire it for Anthropic + OpenAI immediately.
- Semantic caching is the wedge for chat/FAQ workloads. Tune threshold and monitor false positives.
- Batch API for offline workloads. 50% discount; deploy when applicable.
- Don't over-engineer. Prompt compression and self-hosting are the right call only for specific large-scale workloads.
The combined impact of these techniques is the difference between an LLM bill that scales linearly with usage and one that scales sub-linearly. Engineering the latter is the cost discipline that makes AI products defensible at scale.
Where to go next
For the techniques deepest dives: AI API caching (Techniques 1 + 2 + 9), LLM observability (attribution + monitoring), LLM budget governance (the FinOps layer on top of cost reduction).
For platform comparisons: Prism vs Portkey, Prism vs Helicone, Prism vs LiteLLM, and the AI gateway comparison.
For modelling your specific workload: savings calculator.
Frequently asked questions
How much can I realistically reduce my LLM bill?
40-60% on workloads with mixed task complexity and stable system prompts. 20-30% on workloads where every request is genuinely novel (e.g. one-shot transformation of fresh content). Above 60% requires aggressive engineering — Batch API for tolerant workloads, prompt compression, or moving to self-hosted open-weights for high-volume slices.
Which technique has the best ROI?
Provider-native prompt caching (Technique 9) is the highest-ROI quick win — 90% discount on cached tokens for Anthropic, 50% for OpenAI, with minimal engineering work. The next-highest is exact-match caching (Technique 1) — cheap, safe, immediate. Both should be deployed in week one.
Can I get to zero LLM cost?
No, except for workloads that don't actually need LLMs. The point of cost reduction is sub-linear scaling, not zero — the LLM call has to happen for cache misses, novel requests, and reasoning-heavy work. The right question is "what's the lowest cost per useful response" rather than "how do I avoid paying."
Does routing actually preserve quality?
When done well, yes. Task-type routing sends simple requests to small fast models and complex reasoning to frontier models — the small models handle the simple work fine; the frontier models earn their price on the work that needs them. Quality regressions show up when routing pushes too much work to small models without measuring downstream feedback. Pair routing with feedback capture.
Are these techniques compatible with all providers?
Mostly. Caching techniques (1, 2) work across all providers. Provider-native prompt caching (9) is provider-specific — Anthropic and OpenAI implement it; not all alternatives do. Routing (5, 6) requires either a gateway that supports multi-provider or your own routing layer. Batch API (13) is OpenAI + Anthropic; other providers vary. Most techniques apply broadly.
What's the relationship between cost reduction and quality?
Inversely tied at the margins, neutral or positive at the foundations. Caching (1, 2) is quality-neutral on hits, quality-negative on false positives — tune threshold to keep false positives <2%. Routing is quality-positive when done well (the right model per task often beats one model for everything) and quality-negative when over-aggressive. Output capping is quality-negative if set too tight. Most teams optimise to keep quality flat or rising while cost falls; that's achievable with disciplined instrumentation.
Is Batch API worth the latency overhead?
For any workload that doesn't need real-time response, yes — 50% discount is substantial. Common applications: offline analytics on logged user data, bulk classification of incoming items, evaluation runs against datasets. The 24-hour latency rules out interactive workloads; for batch workflows it's a clean win.
What about self-hosting open-weights models?
Depends on scale. Self-hosting Llama or Mistral on your own GPUs makes economic sense above ~$30-50K/month in equivalent provider spend, assuming you have infrastructure operations capacity. Below that, the engineering + ops cost exceeds the savings. Above that, it can be the largest cost-reduction lever available — but it's a strategic platform decision, not a quick win.
The techniques above stack with each other and with the broader AI FinOps discipline. See LLM budget governance for the layer that adds attribution + budgets + policy on top.
Deep dives on llm cost reduction
Five cluster posts unpack the sub-topics of this pillar. Each ships independently as part of the content calendar.