LLM observability in 2026: what you need, what you don't, and the framework for picking
Last updated:
· 16 min readThree layers of LLM observability — gateway, platform, eval-driven — what each captures, when each matters, and how to pick the stack for your workload.
LLM observability in 2026 is a three-layer discipline: gateway-layer observability (cost, latency, cache status per request, captured by the proxy in front of providers); platform-layer observability (span-level traces, parent sessions, custom metadata, captured by an instrumentation SDK in the application); and evaluation-driven observability (dataset experiments, LLM-as-judge scoring, prompt-version A/B testing, focused on quality rather than throughput). Each layer answers different questions, and most production teams run two or three of them simultaneously. This guide is the framework for picking the right stack — what to capture, what tools fit each layer, what feedback signals matter, and how to combine them without paying for overlap.
Why a three-layer model
Every observability discussion in the LLM space gets tangled because three distinct categories of instrumentation are sold under the same name. Untangling them first makes the rest of the decision tractable.
Gateway-layer observability lives in the proxy between application code and providers. Every request that flows through the gateway gets a row in a usage table: cost, latency, model used, mode chosen, cache status (hit/miss/which layer), token counts, errors. The gateway captures these automatically because every request flows through it. This is the cheapest observability layer to add (any gateway provides it) and answers the "what did we just spend / how slow was it" questions cleanly.
Platform-layer observability lives in an SDK embedded in the application code. The SDK instruments LLM calls (and increasingly, agent step trees, tool invocations, parent sessions, custom metadata) and emits traces to a central platform — Langfuse, LangSmith, Helicone, or similar. The platform aggregates the traces, runs span-tree visualisation, supports filtering by session/user/feature, and exposes a queryable data model. This is the right layer for "what was the model thinking / how did the agent decide" questions and for the deeper context that a gateway can't see (the gateway sees individual provider calls; the platform sees the surrounding business logic).
Evaluation-driven observability is platform-layer plus quality scoring. Datasets of representative prompts, periodic LLM-as-judge runs against held-out examples, prompt-version A/B testing, human annotation queues. The mechanic is observation in service of quality improvement, not just visibility. Useful for teams shipping production agents where prompt-engineering iteration is a continuous discipline rather than a one-time setup.
A team can run any one of these, any two, or all three. The right combination depends on what questions you're trying to answer and how mature the LLM workload is.
Layer 1 — Gateway observability
The minimum useful instrumentation for any production LLM workload. If you're running an AI API gateway (Prism, Portkey, Helicone, LiteLLM, Cloudflare AI Gateway, OpenRouter), gateway-layer observability is included by default.
What it captures:
- Per-request cost (provider cost + gateway markup)
- Latency (p50, p95, p99 per provider, per model, per task type)
- Cache hit/miss (which layer caught the hit, similarity score on semantic hits)
- Token counts (input, output, cached tokens for provider-native cache hits)
- Errors (error code, retry behaviour, failover triggers)
- Per-project attribution (when tagged via header)
What it answers:
- "Did we just spend an unusual amount on this call?"
- "Which model is slow today?"
- "What's our cache-hit rate by workload?"
- "Which team's traffic spiked overnight?"
What it doesn't answer:
- "What was the agent thinking at each step?"
- "Did this prompt version perform better than the previous one?"
- "Why did the model produce that specific wrong answer?"
For pure-API workloads (single-call request-response patterns), gateway-layer is often all you need. For agent workloads, you'll want Layer 2 on top.
Prism's gateway-layer observability
Prism (and most modern gateways) ship gateway-layer observability as a built-in product surface. The relevant features:
- Per-request explorer at
/dashboard/usage— every request as a row, with cost, latency, cache status, model, mode, tokens, error if any. Filterable by date, project, feature tag. - Feature-tag attribution via
X-Prism-Tagsheader (up to 10 tags per request) — drives per-feature, per-team, per-environment cost dashboards on Pro+. - Cache analytics at
/dashboard/cache— hit rate by layer (exact / semantic / provider-native), by task type, by model, with hit-rate-at-threshold curves so you can simulate tuning. - Per-response headers with per-request signals —
X-Prism-Cache-Status,X-Prism-Cache-Saved-Cents,X-Prism-Cost-Cents,X-Prism-Task-Type,X-Prism-Model. The headers are how application code can react to gateway-layer signals in real time. - Feedback capture via
POST /v1/feedback— thumbs-up/down + rating + comment + tag per request, correlated byX-Prism-Feedback-Idreturned in the response.
VERIFY (founder): confirm the feedback endpoint shape (
/v1/feedback), the feedback ID header name, and the cache analytics endpoint (/dashboard/cachevs another path).
Layer 2 — Platform observability
The layer that captures what the gateway can't see: the business logic surrounding the LLM call. Used by teams shipping agents, multi-step workflows, or any system where one user action produces multiple LLM calls connected by application state.
What it captures:
- Span-level traces — each LLM call as a span, parent functions as parent spans, tool invocations as child spans
- Sessions — group multiple traces under one user/session ID
- Custom metadata — anything the application wants to attach (user ID, feature name, A/B variant, retrieval source)
- Scoring — LLM-as-judge results, human ratings, automated quality signals
What it answers:
- "Walk me through what happened in this specific agent session"
- "Which prompt versions are users rating highly?"
- "What's the failure pattern when the agent gets confused?"
- "Which retrieval sources lead to the best outcomes?"
What it doesn't do (typically):
- Sit inline in the request path. Platform observability is parallel — the SDK sends traces to the platform; the model call still happens via whatever provider or gateway the app calls.
- Enforce. The platform observes; it doesn't block requests, enforce budgets, or fail over. That's gateway territory.
Picking a platform-layer tool
The major players:
- Langfuse — open-source (MIT), self-hostable. Hobby tier free, Core $29/mo, Enterprise $2,499+/mo. Strong on traces + evaluations + prompt management. Pro+ has SOC 2 + ISO27001 + HIPAA support. See Prism vs Langfuse.
- LangSmith — proprietary, LangChain's commercial product. Developer $0, Plus $39/seat/mo, Enterprise custom. Strong on agent-specific features (fleet, sandboxes, SmithDB for trace queries) and LangChain-native instrumentation. See Prism vs LangSmith.
- Helicone — managed SaaS with self-hostable proxy code. Observability-first gateway (so it does Layer 1 + Layer 2 in one product). See Prism vs Helicone.
- Custom build on OpenTelemetry — both Langfuse and LangSmith accept OTel traces. Roll your own platform-layer observability if you have OTel infrastructure already and want to stay self-hosted.
The choice rotates on: open-source preference (Langfuse), LangChain-framework usage (LangSmith), unified gateway+observability product (Helicone), or self-hosting via OTel.
Layer 3 — Evaluation-driven observability
The most engineering-heavy and most quality-oriented layer. Built around the idea that observability isn't enough — you need to deliberately test prompts and models against representative cases, score the outputs, and iterate on prompts based on what you learn.
Components:
- Datasets — curated sets of representative prompts + expected outcomes
- Online evaluators — LLM-as-judge scoring functions that run on a fraction of production traffic
- Offline experiments — testing a prompt or model change against a dataset before rolling it out
- Human annotation queues — queues of low-confidence or sampled production responses for human review
- A/B testing — running two prompt variants in parallel and comparing scoring distributions
LangSmith and Langfuse both ship the full surface. Custom builds against OpenTelemetry + a scoring framework (DeepEval, RAGAS, Promptfoo) are possible but represent significant engineering investment.
When evaluation-driven observability matters:
- You're shipping a production agent where prompt quality directly affects user outcomes
- You're iterating prompts continuously and need to know whether a change helped or hurt
- You have user feedback signals (thumbs / ratings) and want to correlate them back to prompt versions
When it doesn't:
- Your LLM workload is single-call request-response with no prompt iteration discipline
- You're in early product-market-fit mode where prompt iteration is ad-hoc
- The cost of the evaluation infrastructure exceeds the value of the quality gains
Most teams adopt evaluation-driven observability after they have gateway-layer and platform-layer in place — it's the layer that pays off once you have enough volume to make the experiments statistically meaningful.
What good cost attribution looks like
A specific slice of Layer 1 worth deep-diving because it's the highest-leverage observability practice for AI workloads: per-feature and per-team cost attribution.
The mechanic: every LLM API call carries a request tag (e.g. X-Prism-Tags: team=growth,feature=onboarding-chat,env=production). The gateway persists the tags on the usage log row. The dashboard aggregates by tag — daily / weekly / monthly cost per team, per feature, per environment.
Without this, "AI is expensive" is the conversation; with it, "the onboarding-chat feature is using 60% of our AI budget and we should look at it" is the conversation. The difference in actionability is enormous.
What to tag:
- Team (which group is accountable)
- Feature (which product capability drove the call)
- Environment (production / staging / development)
- Optionally: experiment ID (for A/B tests), user cohort (for free vs paid users), parent flow (for multi-step processes)
The discipline that makes it work:
- Agree the schema once
- Ship it as a shared client wrapper, not as ad-hoc inline header strings
- Lint against drift (
feature=chatvsfeature=chatbotvsfeature=user_chatare three different aggregates of the same workload) - Audit quarterly — prune unused tags, normalise drift
Latency observability — the percentiles that matter
The other deep-dive worth covering: latency tracking. Average latency is misleading. The percentiles that matter:
- p50 (median) — what most users experience. The number to report in a casual conversation.
- p95 — the slow tail. 1 in 20 requests is slower than this. Optimise here.
- p99 — the truly slow tail. 1 in 100 requests. Surfaces provider degradation issues.
Per-provider, per-model latency dashboards are the most useful Layer 1 surface for routing decisions. If Anthropic's p95 latency drifts from 1,200ms to 1,800ms over a week, you want to know — that's a routing-table-revision signal. Per-mode latency (eco/balanced/sport in Prism's vocabulary) is the next dimension worth dashboarding.
Why not p99.9 or p99.99? Sample size. At any meaningful query volume, p99.9 sits in a region where you have a handful of samples and the noise dominates the signal. Real production deployments track p50/p95/p99 and treat the further tail as an SRE problem rather than an optimisation problem.
Feedback capture — the quality signal
The other side of observability is quality signal capture. Users have opinions on whether responses were good; the question is whether your system catches and acts on them.
The mechanic: every response returns a feedback ID (Prism uses X-Prism-Feedback-Id header). The application surfaces a UI for the user to react — thumbs up/down, 1-5 rating, free-text comment, optional category tag. The reaction posts back to the gateway/platform correlated by the feedback ID. The dashboard aggregates feedback by model, prompt version, feature, team.
The discipline:
- Capture friction-free — one click for thumbs, optional comment
- Don't over-prompt — quality feedback fatigue is real; ask for it on a small fraction of responses, not every one
- Tag the feedback with what you want to slice by (prompt version, model, feature)
- Close the loop: act on patterns visible in the aggregate. If thumbs-down spikes on a specific feature, fix the prompt or escalate to a higher-quality model.
Without feedback, observability is one-sided — you see what happened, not whether users liked it. With it, the observability stack becomes a continuous-improvement engine.
How Prism implements LLM observability
Prism ships Layer 1 (gateway-layer observability) deeply, partial Layer 3 (per-request feedback capture but not full eval-driven tooling), and defers Layer 2 to dedicated platforms (Langfuse, LangSmith) when teams need it.
Specifically:
- Per-request explorer at
/dashboard/usage— filterable by date, project, model, provider, mode, cache status, feature tag. Export to CSV on Paid+. - Per-feature attribution via
X-Prism-Tags(up to 10 tags per request). Pro+ unlocks the per-feature dashboard. - Latency analytics at
/dashboard/usage/latency— p50/p95/p99 (Pro/Team) per provider, per model, per mode. - Cache analytics at
/dashboard/cache— hit rate per layer, hit-rate-at-threshold curves, top hits + misses by prompt fingerprint pattern. - Audit log at
/dashboard/usage→ Audit tab — append-only record of policy changes, budget changes, and enforcement firings. 30-day retention on Pro, 365-day on Team. - Feedback capture via
POST /v1/feedbackcorrelated byX-Prism-Feedback-Id. Aggregated by model + prompt version + tag in the dashboard. - Provider health dashboard — rolling-window success rate + latency per provider, used internally for failover routing decisions but visible to customers on Team tier.
VERIFY (founder): confirm the dashboard paths (
/dashboard/usage/latency,/dashboard/cache, audit tab location). Confirm the Pro vs Team feature splits for latency percentiles + audit retention.
What Prism doesn't ship: span-level tracing, dataset experiments, LLM-as-judge online evaluators, prompt-version A/B testing infrastructure. Those are Layer 2 / Layer 3 territory and best served by Langfuse, LangSmith, or a custom build. The natural production architecture for agent-heavy teams: Prism for Layer 1 (gateway-layer + cost engineering) + Langfuse or LangSmith for Layer 2/3 (span tracing + evaluation).
Build vs buy
The build-vs-buy decision for observability is layer-specific.
Layer 1 (gateway-layer): buy it bundled with whatever AI gateway you adopt. Building Layer 1 on top of direct provider calls is a substantial engineering investment (per-request logging, dashboard surface, attribution machinery, cache analytics) that's already shipped as a product feature by every credible gateway. Almost never the right build.
Layer 2 (platform-layer): build is plausible if you have OpenTelemetry infrastructure already and want to send LLM traces alongside the rest of your application telemetry. Buy (Langfuse, LangSmith, Helicone) is faster to adopt and gives you LLM-specific affordances (span types for agent steps, tool calls, retrievals) that generic OTel doesn't.
Layer 3 (evaluation-driven): buy unless you have a serious ML platform team. The eval infrastructure — datasets, scorers, experiment runners, human annotation queues — is substantial engineering and the buy products (Langfuse, LangSmith) are mature.
Decision framework
If you're setting up LLM observability on a real team:
- Start with Layer 1. Whatever AI gateway you adopt gives it to you. Don't skip the attribution tags — they cost nothing to add and unlock everything downstream.
- Add Layer 2 when you have agent workloads. Span-level tracing matters for multi-step agent debugging. For pure-API workloads, Layer 1 is often enough.
- Add Layer 3 when prompt iteration becomes continuous. Datasets + LLM-as-judge + A/B testing pays off when you're shipping prompt changes weekly and need to know whether each one helped.
- Don't pay for overlap. Layer 1 + Layer 2 from two different vendors is fine; Layer 1 from two different gateways is wasteful.
- Capture feedback even when you can't act on it yet. The data accumulates; once you have signal, the closed-loop analysis becomes possible.
The cost of LLM observability scales nicely with how mature the workload is. Layer 1 is essentially free (bundled with the gateway). Layer 2 is moderate (a $29-$200/month managed product or self-hosted OSS). Layer 3 is the most engineering-heavy, paid back when prompt iteration is genuinely continuous.
Where to go next
For comparison-page depth on the major observability platforms:
- Prism vs Langfuse — gateway vs OSS observability platform
- Prism vs LangSmith — gateway vs LangChain's commercial observability + eval product
- Prism vs Helicone — gateway-with-deep-observability comparison
For the cost-engineering side that observability informs: AI API caching + LLM budget governance.
For the routing primitive that benefits from observability data: task-type routing + multi-provider failover.
Frequently asked questions
What's the minimum useful LLM observability instrumentation?
Per-request cost broken down by feature tag, and p95 latency per provider per model. Those two are the actionable layer that drives most of the cost-engineering and routing decisions. Cache-hit rate by workload and error rate by provider are the close seconds.
Do I need Langfuse or LangSmith if I'm using Prism?
Depends on the workload. If you're running pure-API workloads (one user message → one model response, no agent step trees), Prism's gateway-layer observability is often sufficient. If you're running agents, multi-step workflows, or doing serious prompt iteration with A/B testing, you'll want Layer 2 (Langfuse or LangSmith) alongside Prism. They solve different problems at different layers.
What's the difference between observability and evaluation?
Observability is "what happened" — capture data, surface dashboards, debug incidents. Evaluation is "how good was it" — score outputs against expected outcomes, A/B test prompts, measure quality. Layer 2 (platforms) does observability deeply; Layer 3 (eval) layers on top. Both come from the same vendor in most cases (Langfuse and LangSmith both ship both layers).
Should I capture feedback on every response or sample?
Sample. Asking for feedback on every response causes fatigue — users stop responding, the signal degrades. A common pattern: present a thumb-up/thumb-down UI on every response (zero-friction, ~10% capture rate), prompt for a rating + comment on 5% of responses. Adjust based on response rates.
How does cost attribution actually work?
At the gateway layer, every request carries one or more tags (e.g. X-Prism-Tags: feature=chat,team=growth,env=production). The gateway persists these on the usage log row. The dashboard aggregates by tag — per-feature cost per day, per-team monthly spend, etc. The discipline that makes it work is consistent tagging via a shared client wrapper.
What latency percentiles should I track?
p50, p95, p99. p50 (median) is what most users experience; p95 is the slow tail (1 in 20 requests); p99 is the truly slow tail (1 in 100). Further percentiles (p99.9, p99.99) have noisy sample sizes at most production volumes; treat them as SRE problems, not optimisation targets.
Can I use OpenTelemetry for LLM observability?
Yes — both Langfuse and LangSmith accept OTel traces. If your application already emits OTel spans, you can add LLM-specific span attributes (model, tokens, cost) and route them to a Langfuse/LangSmith backend without adopting a vendor-specific SDK. The trade is some LLM-specific affordances (span types for agent steps, tool calls) need manual annotation when using generic OTel.
Does observability slow requests down?
Negligibly at the gateway layer — the proxy is already in the request path, so logging adds microseconds. Negligibly at the platform layer — SDK trace emission is typically async / non-blocking. The only meaningful overhead comes from synchronous quality-scoring evaluators (Layer 3) that block on a judge LLM call; those should run async or on sampled traffic to avoid impacting user-perceived latency.
Observability and cost engineering are complementary disciplines. Read the AI API caching guide and the LLM budget governance guide for the cost side. The AI gateway comparison covers the gateway choice that drives Layer 1.
Deep dives on llm observability
Five cluster posts unpack the sub-topics of this pillar. Each ships independently as part of the content calendar.