What is llm cost reduction?

14 techniques ranked by ROI, each with measured savings on real workloads. Prism covers this topic from the perspective of an AI API proxy that ships measured production data on every request — not vendor estimates.

How does Prism handle llm cost reduction?

Prism is an OpenAI-compatible AI API proxy that addresses llm cost reduction directly. See the deep-dive posts in this guide for the per-sub-topic implementation details, or jump to the savings calculator to model the impact on your workload.

LLM Cost Reduction — Prism guide

Every meaningful technique for cutting an LLM bill, ranked by measured production impact. From quick wins (caching, routing) to structural changes (model selection, prompt engineering).

There are 14 techniques that actually reduce LLM API spend in production, ranked here by measured ROI: caching catches 30-60%; routing the right model per request catches 30-50%; provider-native prompt caching catches 40-90% of input-token cost on stable-prefix workloads; prompt compression and structured output catch 10-30%; batching, streaming discipline, and quota-aware retries catch 5-15% each. Stacked together they routinely cut bills in half, sometimes more. This playbook is the technique-by-technique walkthrough — what each one does, how much it saves on which workloads, and the order to implement them. Written for engineers who want to cut a real bill, not for vendors selling theory.

The framing

Before walking through techniques, the framing that makes them rank-orderable: every dollar saved on LLM spend comes from one of three buckets.

1. Don't call the model at all. Caching is the king of this bucket — if you can return a cached response, you avoid the provider call entirely. Memoisation, deterministic routing rules, and certain async-batching patterns also fall here.

2. Call a cheaper model. Routing techniques live here. The right small model for a simple task costs 50-100x less than the right frontier model. Even modest task-type routing captures most of this wedge.

3. Call the model with fewer tokens. Prompt compression, output capping, structured output, provider-native prompt caching (which doesn't avoid the call but discounts the input tokens). Every saved token at the input or output is real money.

The techniques below are organised by which bucket they primarily belong to, ranked within each bucket by typical production ROI. Stack across all three buckets for the largest impact.

Bucket 1 — Don't call the model

Technique 1: Exact-match response caching

ROI: high. Effort: low. The cheapest, safest technique to deploy. Fingerprint each request (SHA-256 over canonicalised messages + model + sampling parameters), store the response in a key-value store like Redis, return on byte-identical repeats. Hit rate in production AI traffic typically 5-15%; the wins are pure savings (provably correct) with sub-10ms p95 lookup overhead. See exact vs semantic caching for the deep dive.

Technique 2: Semantic response caching

ROI: high. Effort: medium. Embed the prompt, look up nearest stored embedding, return cached response if cosine similarity exceeds a threshold (0.95 default). Catches 25-50% of remaining traffic on workloads with paraphrasable intent (customer support, FAQ, documentation Q&A). Stacks with exact-match. The complexity is threshold tuning + false-positive monitoring; the discipline is documented in the semantic cache glossary entry.

Technique 3: Memoisation at the function layer

ROI: medium. Effort: low. Above the gateway, in your application code: if the same input function is called repeatedly with the same arguments, memoise. A standard programming-language pattern that doesn't require any LLM-specific infrastructure. Catches the cases where the LLM call is wrapped in a function that the application calls multiple times per session.

Technique 4: Deterministic routing rules to skip the model

ROI: medium. Effort: low. For specific known patterns, don't even consider the LLM. "If the user types 'help', show the help menu — don't send it to the model." Sounds obvious; commonly missed because the LLM has become the default for everything. Audit your app for cases where a rule-based response is both faster and free.

Bucket 2 — Call a cheaper model

Technique 5: Task-type routing

ROI: very high. Effort: medium. Classify each incoming request by task type (simple / code / reasoning / complex), look up the right model in a routing table calibrated per task. A frontier model for reasoning-heavy work, a fast small model for extraction or formatting. Production deployments running this see 40-60% cost reduction on workloads with mixed task complexity. The discipline is detailed in task-type routing.

Technique 6: Mode-driven routing (eco / balanced / sport)

ROI: high. Effort: low. A simpler form of routing: the caller declares cost-quality intent via a mode flag, the gateway maps mode + task type to the right model. Doesn't require app-layer classification — just a header on each request. The right starting point for teams not ready to build a classifier; covers most of the savings of full task-type routing for most workloads.

Technique 7: Model downsizing on cache-miss

ROI: medium. Effort: low. When the cache misses, retry the request with a cheaper model. Worst-case: the cheap model gives a worse response and the customer notices. Mitigation: use only on workloads where quality is high but not critical (internal tools, low-stakes summarisation). Not the right call for customer-facing critical paths.

Technique 8: Provider redundancy + cheapest-route arbitrage

ROI: medium. Effort: medium-high. If multiple providers host equivalent models (e.g. Llama 3 hosted on Groq, Cerebras, Fireworks, Together), pick the cheapest healthy one per request. The price spread can be 2-3x for the same model across hosts; arbitrage captures that. The cost is provider-relationship complexity — managing multiple API keys, watching for individual provider degradation, handling per-provider rate limits.

Bucket 3 — Call with fewer tokens

Technique 9: Provider-native prompt caching

ROI: very high on stable-prefix workloads. Effort: low. Anthropic discounts cache-read tokens to 10% of normal input price (a 90% discount, with a 25% write premium on first request); OpenAI discounts cached tokens to 50% automatically on prompts ≥1,024 tokens. Pays off any time your system prompt or static context is stable across requests, which is most production traffic. The deep dive: prompt caching and provider-native caching.

Technique 10: Prompt compression

ROI: medium-high. Effort: medium. Reduce input tokens by compressing the prompt: trim redundant context, remove irrelevant retrievals, summarise lengthy chat history before sending. Specific techniques: sliding window with summary (keep the last N messages + a summary of older ones), tools like LLMLingua that compress prompts via a smaller model before sending to the large one. Production deployments see 20-40% input-token reduction with negligible quality impact when done well. Done poorly, quality regression is real.

Technique 11: Output capping with max_tokens

ROI: high on verbose workloads. Effort: trivial. Set max_tokens aggressively. If the response can be 200 tokens, don't leave the cap at the default 4096. Output tokens cost 4-5x input tokens; constraining output is one of the highest-leverage changes you can make in 30 seconds. The trade is occasionally truncated responses; the mitigation is sensible per-task max_tokens defaults.

Technique 12: Structured output / JSON mode

ROI: medium. Effort: low. Using JSON mode or structured output (vs free-form text) typically reduces output tokens by 30-50% because the LLM doesn't pad with explanatory prose. The trade: structured output is more constrained in style. Works well for extraction, classification, function-calling-shaped tasks. Doesn't work for "explain this to me like I'm five" workloads.

Cross-bucket techniques

Technique 13: Async batching with Batch API

ROI: very high for tolerant workloads. Effort: medium. OpenAI's Batch API offers 50% discount on chat completions when you submit a batch of requests and accept up to 24-hour processing latency. Same Anthropic on a similar pattern. For workloads that don't need real-time response (offline analysis, bulk classification, evaluation runs), this is one of the highest-leverage techniques available. Doesn't apply to anything user-facing where latency matters.

Technique 14: Streaming discipline + cancellation

ROI: small but real. Effort: small. Some applications send a long-running streaming request and then stop using the response before it completes — but the provider keeps generating tokens until the connection closes. Disciplined streaming: cancel as soon as the application has what it needs (the user navigated away, the downstream function got its answer, etc.). Modest savings (5-10% on workloads with frequent cancellations); meaningful at scale.

How they stack

The compound savings story is what makes the LLM cost-reduction discipline so favourable. A worked example: a 20K req/day support chatbot on Claude Sonnet, starting at $138/day baseline.

Layer	Technique	Cumulative savings	Running total
Bucket 1	Exact + semantic caching (~43% combined)	$59/day	$79/day
Bucket 3	Anthropic prompt caching on system prompt (~60% input-token reduction on the calls that miss caching)	$9/day	$70/day
Bucket 2	Task-type routing (~35% of remaining traffic to cheaper models with no quality impact)	$11/day	$59/day
Bucket 3	max_tokens tuned per task type (~15% output-token reduction)	$5/day	$54/day
Total stacked	—	$84/day saved (~61%)	$54/day from $138/day

VERIFY (founder): replace the worked example with one drawn from real Prism customer data or representative aggregated patterns at current provider pricing. The illustrative numbers are reasonable but worth grounding in production.

The point isn't the specific numbers — it's the shape. Each technique adds incremental savings on top of the previous ones, and the discipline compounds. The 61% reduction in this example is normal for well-instrumented production deployments; teams that don't deploy any of these techniques pay 2.5x what they need to.

Implementation order

The right order to deploy these techniques depends on team capacity, but the canonical sequence:

Week 1 — the quick wins (Bucket 1 + 3 cheapest):

Set sensible max_tokens per task type (Technique 11) — 30 minutes of work
Audit for rule-based skip-the-LLM patterns (Technique 4) — half a day
Deploy exact-match caching (Technique 1) — half a day with a managed gateway, 2-3 days self-built
Enable structured output where applicable (Technique 12) — varies by use case

By end of week 1, you should see 15-30% cost reduction on instrumented workloads.

Week 2-3 — the routing wedge (Bucket 2): 5. Deploy mode-driven or task-type routing (Techniques 5, 6) 6. Wire provider-native prompt caching for Anthropic + OpenAI traffic (Technique 9)

By end of week 3, cumulative reduction typically 40-55%.

Week 4+ — the optimisation continuation: 7. Deploy semantic caching with threshold tuning (Technique 2) 8. Evaluate Batch API for tolerant workloads (Technique 13) 9. Investigate prompt compression (Technique 10) — requires more careful quality monitoring

By month 2, well-instrumented workloads land at 50-65% reduction sustainably.

What doesn't work

Equally important: techniques that are talked about but don't actually reduce cost meaningfully.

Switching to a "cheaper provider" for everything. Across-the-board price arbitrage doesn't typically work because the quality differences across price tiers are real. Per-task routing captures the right cheap-model wins; per-everything routing trades quality for marginal savings.

Aggressive model quantisation / self-hosting open-weights. Theoretically cheap; practically expensive in engineering time and operational overhead unless you're at >$50K/month in LLM spend with serious infrastructure capacity.

Pre-emptively summarising user input. Adds an LLM call to the pipeline, often costs more than it saves unless your inputs are pathologically long.

Switching the entire stack to embeddings + retrieval to avoid LLM calls. Useful when applicable; not a general-purpose cost-reduction lever.

How Prism implements the wedge

Prism ships the highest-ROI techniques as default-on product features:

Layer 1 + Layer 2 caching automatic on every Pro+ request — exact-match (Technique 1) + semantic (Technique 2) running concurrently
Provider-native passthrough (Technique 9) with the discount passed through to the customer, not absorbed
Mode-driven routing (Technique 6) via X-Prism-Mode header — eco/balanced/sport map to the right model per task type, calibrated from a benchmark
Per-feature cost attribution so you can see which Bucket-1/2/3 levers are saving the most on which features
Public savings counter + X-Prism-Cache-Saved-Cents header per response — measurable savings, not vendor estimates

What Prism doesn't (yet) ship as a managed feature: prompt compression (Technique 10) and Batch API integration (Technique 13). Both are roadmap candidates; today both are application-layer concerns.

VERIFY (founder): confirm Prism feature mapping above matches the production tier matrix. Add or remove items based on what's actually live.

Decision framework

If you're standing up LLM cost-reduction discipline on a real team:

Start with attribution. You can't reduce what you can't see. See LLM observability for the framework.
Quick wins first. max_tokens + cache + rule-based skips deliver 15-30% in week one.
Routing next. Mode-driven or task-type routing adds another 20-30% in week two.
Provider-native passthrough is free money. Wire it for Anthropic + OpenAI immediately.
Semantic caching is the wedge for chat/FAQ workloads. Tune threshold and monitor false positives.
Batch API for offline workloads. 50% discount; deploy when applicable.
Don't over-engineer. Prompt compression and self-hosting are the right call only for specific large-scale workloads.

The combined impact of these techniques is the difference between an LLM bill that scales linearly with usage and one that scales sub-linearly. Engineering the latter is the cost discipline that makes AI products defensible at scale.

Where to go next

For the techniques deepest dives: AI API caching (Techniques 1 + 2 + 9), LLM observability (attribution + monitoring), LLM budget governance (the FinOps layer on top of cost reduction).

For platform comparisons: Prism vs Portkey, Prism vs Helicone, Prism vs LiteLLM, and the AI gateway comparison.

For modelling your specific workload: savings calculator.

Frequently asked questions

How much can I realistically reduce my LLM bill?

40-60% on workloads with mixed task complexity and stable system prompts. 20-30% on workloads where every request is genuinely novel (e.g. one-shot transformation of fresh content). Above 60% requires aggressive engineering — Batch API for tolerant workloads, prompt compression, or moving to self-hosted open-weights for high-volume slices.

Which technique has the best ROI?

Provider-native prompt caching (Technique 9) is the highest-ROI quick win — 90% discount on cached tokens for Anthropic, 50% for OpenAI, with minimal engineering work. The next-highest is exact-match caching (Technique 1) — cheap, safe, immediate. Both should be deployed in week one.

Can I get to zero LLM cost?

No, except for workloads that don't actually need LLMs. The point of cost reduction is sub-linear scaling, not zero — the LLM call has to happen for cache misses, novel requests, and reasoning-heavy work. The right question is "what's the lowest cost per useful response" rather than "how do I avoid paying."

Does routing actually preserve quality?

When done well, yes. Task-type routing sends simple requests to small fast models and complex reasoning to frontier models — the small models handle the simple work fine; the frontier models earn their price on the work that needs them. Quality regressions show up when routing pushes too much work to small models without measuring downstream feedback. Pair routing with feedback capture.

Are these techniques compatible with all providers?

Mostly. Caching techniques (1, 2) work across all providers. Provider-native prompt caching (9) is provider-specific — Anthropic and OpenAI implement it; not all alternatives do. Routing (5, 6) requires either a gateway that supports multi-provider or your own routing layer. Batch API (13) is OpenAI + Anthropic; other providers vary. Most techniques apply broadly.

What's the relationship between cost reduction and quality?

Inversely tied at the margins, neutral or positive at the foundations. Caching (1, 2) is quality-neutral on hits, quality-negative on false positives — tune threshold to keep false positives <2%. Routing is quality-positive when done well (the right model per task often beats one model for everything) and quality-negative when over-aggressive. Output capping is quality-negative if set too tight. Most teams optimise to keep quality flat or rising while cost falls; that's achievable with disciplined instrumentation.

Is Batch API worth the latency overhead?

For any workload that doesn't need real-time response, yes — 50% discount is substantial. Common applications: offline analytics on logged user data, bulk classification of incoming items, evaluation runs against datasets. The 24-hour latency rules out interactive workloads; for batch workflows it's a clean win.

What about self-hosting open-weights models?

Depends on scale. Self-hosting Llama or Mistral on your own GPUs makes economic sense above ~$30-50K/month in equivalent provider spend, assuming you have infrastructure operations capacity. Below that, the engineering + ops cost exceeds the savings. Above that, it can be the largest cost-reduction lever available — but it's a strategic platform decision, not a quick win.

The techniques above stack with each other and with the broader AI FinOps discipline. See LLM budget governance for the layer that adds attribution + budgets + policy on top.

The LLM cost reduction playbook: 14 techniques, measured impact, ranked by ROI

The framing

Bucket 1 — Don't call the model

Technique 1: Exact-match response caching

Technique 2: Semantic response caching

Technique 3: Memoisation at the function layer

Technique 4: Deterministic routing rules to skip the model

Bucket 2 — Call a cheaper model

Technique 5: Task-type routing

Technique 6: Mode-driven routing (eco / balanced / sport)

Technique 7: Model downsizing on cache-miss

Technique 8: Provider redundancy + cheapest-route arbitrage

Bucket 3 — Call with fewer tokens

Technique 9: Provider-native prompt caching

Technique 10: Prompt compression

Technique 11: Output capping with max_tokens

Technique 12: Structured output / JSON mode

Cross-bucket techniques

Technique 13: Async batching with Batch API

Technique 14: Streaming discipline + cancellation

How they stack

Implementation order

What doesn't work

How Prism implements the wedge

Decision framework

Where to go next

Frequently asked questions

Deep dives on llm cost reduction

See your savings before you sign up

Frequently asked questions

Related reading

AI API Caching

OpenAI Cost Optimization

Llm cost reduction techniques ranked by roi