What is llm observability?

What to instrument first, what to skip, and the framework for picking tools. Prism covers this topic from the perspective of an AI API proxy that ships measured production data on every request — not vendor estimates.

How does Prism handle llm observability?

Prism is an OpenAI-compatible AI API proxy that addresses llm observability directly. See the deep-dive posts in this guide for the per-sub-topic implementation details, or jump to the savings calculator to model the impact on your workload.

LLM Observability — Prism guide

Three layers of LLM observability — gateway, platform, eval-driven — what each captures, when each matters, and how to pick the stack for your workload.

LLM observability in 2026 is a three-layer discipline: gateway-layer observability (cost, latency, cache status per request, captured by the proxy in front of providers); platform-layer observability (span-level traces, parent sessions, custom metadata, captured by an instrumentation SDK in the application); and evaluation-driven observability (dataset experiments, LLM-as-judge scoring, prompt-version A/B testing, focused on quality rather than throughput). Each layer answers different questions, and most production teams run two or three of them simultaneously. This guide is the framework for picking the right stack — what to capture, what tools fit each layer, what feedback signals matter, and how to combine them without paying for overlap.

Why a three-layer model

Every observability discussion in the LLM space gets tangled because three distinct categories of instrumentation are sold under the same name. Untangling them first makes the rest of the decision tractable.

Gateway-layer observability lives in the proxy between application code and providers. Every request that flows through the gateway gets a row in a usage table: cost, latency, model used, mode chosen, cache status (hit/miss/which layer), token counts, errors. The gateway captures these automatically because every request flows through it. This is the cheapest observability layer to add (any gateway provides it) and answers the "what did we just spend / how slow was it" questions cleanly.

Platform-layer observability lives in an SDK embedded in the application code. The SDK instruments LLM calls (and increasingly, agent step trees, tool invocations, parent sessions, custom metadata) and emits traces to a central platform — Langfuse, LangSmith, Helicone, or similar. The platform aggregates the traces, runs span-tree visualisation, supports filtering by session/user/feature, and exposes a queryable data model. This is the right layer for "what was the model thinking / how did the agent decide" questions and for the deeper context that a gateway can't see (the gateway sees individual provider calls; the platform sees the surrounding business logic).

Evaluation-driven observability is platform-layer plus quality scoring. Datasets of representative prompts, periodic LLM-as-judge runs against held-out examples, prompt-version A/B testing, human annotation queues. The mechanic is observation in service of quality improvement, not just visibility. Useful for teams shipping production agents where prompt-engineering iteration is a continuous discipline rather than a one-time setup.

A team can run any one of these, any two, or all three. The right combination depends on what questions you're trying to answer and how mature the LLM workload is.

Layer 1 — Gateway observability

The minimum useful instrumentation for any production LLM workload. If you're running an AI API gateway (Prism, Portkey, Helicone, LiteLLM, Cloudflare AI Gateway, OpenRouter), gateway-layer observability is included by default.

What it captures:

Per-request cost (provider cost + gateway markup)
Latency (p50, p95, p99 per provider, per model, per task type)
Cache hit/miss (which layer caught the hit, similarity score on semantic hits)
Token counts (input, output, cached tokens for provider-native cache hits)
Errors (error code, retry behaviour, failover triggers)
Per-project attribution (when tagged via header)

What it answers:

"Did we just spend an unusual amount on this call?"
"Which model is slow today?"
"What's our cache-hit rate by workload?"
"Which team's traffic spiked overnight?"

What it doesn't answer:

"What was the agent thinking at each step?"
"Did this prompt version perform better than the previous one?"
"Why did the model produce that specific wrong answer?"

For pure-API workloads (single-call request-response patterns), gateway-layer is often all you need. For agent workloads, you'll want Layer 2 on top.

Prism's gateway-layer observability

Prism (and most modern gateways) ship gateway-layer observability as a built-in product surface. The relevant features:

Per-request explorer at /dashboard/usage — every request as a row, with cost, latency, cache status, model, mode, tokens, error if any. Filterable by date, project, feature tag.
Feature-tag attribution via X-Prism-Tags header (up to 10 tags per request) — drives per-feature, per-team, per-environment cost dashboards on Pro+.
Cache analytics at /dashboard/cache — hit rate by layer (exact / semantic / provider-native), by task type, by model, with hit-rate-at-threshold curves so you can simulate tuning.
Per-response headers with per-request signals — X-Prism-Cache-Status, X-Prism-Cache-Saved-Cents, X-Prism-Cost-Cents, X-Prism-Task-Type, X-Prism-Model. The headers are how application code can react to gateway-layer signals in real time.
Feedback capture via POST /v1/feedback — thumbs-up/down + rating + comment + tag per request, correlated by X-Prism-Feedback-Id returned in the response.

VERIFY (founder): confirm the feedback endpoint shape (/v1/feedback), the feedback ID header name, and the cache analytics endpoint (/dashboard/cache vs another path).

Layer 2 — Platform observability

The layer that captures what the gateway can't see: the business logic surrounding the LLM call. Used by teams shipping agents, multi-step workflows, or any system where one user action produces multiple LLM calls connected by application state.

What it captures:

Span-level traces — each LLM call as a span, parent functions as parent spans, tool invocations as child spans
Sessions — group multiple traces under one user/session ID
Custom metadata — anything the application wants to attach (user ID, feature name, A/B variant, retrieval source)
Scoring — LLM-as-judge results, human ratings, automated quality signals

What it answers:

"Walk me through what happened in this specific agent session"
"Which prompt versions are users rating highly?"
"What's the failure pattern when the agent gets confused?"
"Which retrieval sources lead to the best outcomes?"

What it doesn't do (typically):

Sit inline in the request path. Platform observability is parallel — the SDK sends traces to the platform; the model call still happens via whatever provider or gateway the app calls.
Enforce. The platform observes; it doesn't block requests, enforce budgets, or fail over. That's gateway territory.

Picking a platform-layer tool

The major players:

Langfuse — open-source (MIT), self-hostable. Hobby tier free, Core $29/mo, Enterprise $2,499+/mo. Strong on traces + evaluations + prompt management. Pro+ has SOC 2 + ISO27001 + HIPAA support. See Prism vs Langfuse.
LangSmith — proprietary, LangChain's commercial product. Developer $0, Plus $39/seat/mo, Enterprise custom. Strong on agent-specific features (fleet, sandboxes, SmithDB for trace queries) and LangChain-native instrumentation. See Prism vs LangSmith.
Helicone — managed SaaS with self-hostable proxy code. Observability-first gateway (so it does Layer 1 + Layer 2 in one product). See Prism vs Helicone.
Custom build on OpenTelemetry — both Langfuse and LangSmith accept OTel traces. Roll your own platform-layer observability if you have OTel infrastructure already and want to stay self-hosted.

The choice rotates on: open-source preference (Langfuse), LangChain-framework usage (LangSmith), unified gateway+observability product (Helicone), or self-hosting via OTel.

Layer 3 — Evaluation-driven observability

The most engineering-heavy and most quality-oriented layer. Built around the idea that observability isn't enough — you need to deliberately test prompts and models against representative cases, score the outputs, and iterate on prompts based on what you learn.

Components:

Datasets — curated sets of representative prompts + expected outcomes
Online evaluators — LLM-as-judge scoring functions that run on a fraction of production traffic
Offline experiments — testing a prompt or model change against a dataset before rolling it out
Human annotation queues — queues of low-confidence or sampled production responses for human review
A/B testing — running two prompt variants in parallel and comparing scoring distributions

LangSmith and Langfuse both ship the full surface. Custom builds against OpenTelemetry + a scoring framework (DeepEval, RAGAS, Promptfoo) are possible but represent significant engineering investment.

When evaluation-driven observability matters:

You're shipping a production agent where prompt quality directly affects user outcomes
You're iterating prompts continuously and need to know whether a change helped or hurt
You have user feedback signals (thumbs / ratings) and want to correlate them back to prompt versions

When it doesn't:

Your LLM workload is single-call request-response with no prompt iteration discipline
You're in early product-market-fit mode where prompt iteration is ad-hoc
The cost of the evaluation infrastructure exceeds the value of the quality gains

Most teams adopt evaluation-driven observability after they have gateway-layer and platform-layer in place — it's the layer that pays off once you have enough volume to make the experiments statistically meaningful.

What good cost attribution looks like

A specific slice of Layer 1 worth deep-diving because it's the highest-leverage observability practice for AI workloads: per-feature and per-team cost attribution.

The mechanic: every LLM API call carries a request tag (e.g. X-Prism-Tags: team=growth,feature=onboarding-chat,env=production). The gateway persists the tags on the usage log row. The dashboard aggregates by tag — daily / weekly / monthly cost per team, per feature, per environment.

Without this, "AI is expensive" is the conversation; with it, "the onboarding-chat feature is using 60% of our AI budget and we should look at it" is the conversation. The difference in actionability is enormous.

What to tag:

Team (which group is accountable)
Feature (which product capability drove the call)
Environment (production / staging / development)
Optionally: experiment ID (for A/B tests), user cohort (for free vs paid users), parent flow (for multi-step processes)

The discipline that makes it work:

Agree the schema once
Ship it as a shared client wrapper, not as ad-hoc inline header strings
Lint against drift (feature=chat vs feature=chatbot vs feature=user_chat are three different aggregates of the same workload)
Audit quarterly — prune unused tags, normalise drift

Latency observability — the percentiles that matter

The other deep-dive worth covering: latency tracking. Average latency is misleading. The percentiles that matter:

p50 (median) — what most users experience. The number to report in a casual conversation.
p95 — the slow tail. 1 in 20 requests is slower than this. Optimise here.
p99 — the truly slow tail. 1 in 100 requests. Surfaces provider degradation issues.

Per-provider, per-model latency dashboards are the most useful Layer 1 surface for routing decisions. If Anthropic's p95 latency drifts from 1,200ms to 1,800ms over a week, you want to know — that's a routing-table-revision signal. Per-mode latency (eco/balanced/sport in Prism's vocabulary) is the next dimension worth dashboarding.

Why not p99.9 or p99.99? Sample size. At any meaningful query volume, p99.9 sits in a region where you have a handful of samples and the noise dominates the signal. Real production deployments track p50/p95/p99 and treat the further tail as an SRE problem rather than an optimisation problem.

Feedback capture — the quality signal

The other side of observability is quality signal capture. Users have opinions on whether responses were good; the question is whether your system catches and acts on them.

The mechanic: every response returns a feedback ID (Prism uses X-Prism-Feedback-Id header). The application surfaces a UI for the user to react — thumbs up/down, 1-5 rating, free-text comment, optional category tag. The reaction posts back to the gateway/platform correlated by the feedback ID. The dashboard aggregates feedback by model, prompt version, feature, team.

The discipline:

Capture friction-free — one click for thumbs, optional comment
Don't over-prompt — quality feedback fatigue is real; ask for it on a small fraction of responses, not every one
Tag the feedback with what you want to slice by (prompt version, model, feature)
Close the loop: act on patterns visible in the aggregate. If thumbs-down spikes on a specific feature, fix the prompt or escalate to a higher-quality model.

Without feedback, observability is one-sided — you see what happened, not whether users liked it. With it, the observability stack becomes a continuous-improvement engine.

How Prism implements LLM observability

Prism ships Layer 1 (gateway-layer observability) deeply, partial Layer 3 (per-request feedback capture but not full eval-driven tooling), and defers Layer 2 to dedicated platforms (Langfuse, LangSmith) when teams need it.

Specifically:

Per-request explorer at /dashboard/usage — filterable by date, project, model, provider, mode, cache status, feature tag. Export to CSV on Paid+.
Per-feature attribution via X-Prism-Tags (up to 10 tags per request). Pro+ unlocks the per-feature dashboard.
Latency analytics at /dashboard/usage/latency — p50/p95/p99 (Pro/Team) per provider, per model, per mode.
Cache analytics at /dashboard/cache — hit rate per layer, hit-rate-at-threshold curves, top hits + misses by prompt fingerprint pattern.
Audit log at /dashboard/usage → Audit tab — append-only record of policy changes, budget changes, and enforcement firings. 30-day retention on Pro, 365-day on Team.
Feedback capture via POST /v1/feedback correlated by X-Prism-Feedback-Id. Aggregated by model + prompt version + tag in the dashboard.
Provider health dashboard — rolling-window success rate + latency per provider, used internally for failover routing decisions but visible to customers on Team tier.

VERIFY (founder): confirm the dashboard paths (/dashboard/usage/latency, /dashboard/cache, audit tab location). Confirm the Pro vs Team feature splits for latency percentiles + audit retention.

What Prism doesn't ship: span-level tracing, dataset experiments, LLM-as-judge online evaluators, prompt-version A/B testing infrastructure. Those are Layer 2 / Layer 3 territory and best served by Langfuse, LangSmith, or a custom build. The natural production architecture for agent-heavy teams: Prism for Layer 1 (gateway-layer + cost engineering) + Langfuse or LangSmith for Layer 2/3 (span tracing + evaluation).

Build vs buy

The build-vs-buy decision for observability is layer-specific.

Layer 1 (gateway-layer): buy it bundled with whatever AI gateway you adopt. Building Layer 1 on top of direct provider calls is a substantial engineering investment (per-request logging, dashboard surface, attribution machinery, cache analytics) that's already shipped as a product feature by every credible gateway. Almost never the right build.

Layer 2 (platform-layer): build is plausible if you have OpenTelemetry infrastructure already and want to send LLM traces alongside the rest of your application telemetry. Buy (Langfuse, LangSmith, Helicone) is faster to adopt and gives you LLM-specific affordances (span types for agent steps, tool calls, retrievals) that generic OTel doesn't.

Layer 3 (evaluation-driven): buy unless you have a serious ML platform team. The eval infrastructure — datasets, scorers, experiment runners, human annotation queues — is substantial engineering and the buy products (Langfuse, LangSmith) are mature.

Decision framework

If you're setting up LLM observability on a real team:

Start with Layer 1. Whatever AI gateway you adopt gives it to you. Don't skip the attribution tags — they cost nothing to add and unlock everything downstream.
Add Layer 2 when you have agent workloads. Span-level tracing matters for multi-step agent debugging. For pure-API workloads, Layer 1 is often enough.
Add Layer 3 when prompt iteration becomes continuous. Datasets + LLM-as-judge + A/B testing pays off when you're shipping prompt changes weekly and need to know whether each one helped.
Don't pay for overlap. Layer 1 + Layer 2 from two different vendors is fine; Layer 1 from two different gateways is wasteful.
Capture feedback even when you can't act on it yet. The data accumulates; once you have signal, the closed-loop analysis becomes possible.

The cost of LLM observability scales nicely with how mature the workload is. Layer 1 is essentially free (bundled with the gateway). Layer 2 is moderate (a $29-$200/month managed product or self-hosted OSS). Layer 3 is the most engineering-heavy, paid back when prompt iteration is genuinely continuous.

Where to go next

For comparison-page depth on the major observability platforms:

Prism vs Langfuse — gateway vs OSS observability platform
Prism vs LangSmith — gateway vs LangChain's commercial observability + eval product
Prism vs Helicone — gateway-with-deep-observability comparison

For the cost-engineering side that observability informs: AI API caching + LLM budget governance.

For the routing primitive that benefits from observability data: task-type routing + multi-provider failover.

Frequently asked questions

What's the minimum useful LLM observability instrumentation?

Per-request cost broken down by feature tag, and p95 latency per provider per model. Those two are the actionable layer that drives most of the cost-engineering and routing decisions. Cache-hit rate by workload and error rate by provider are the close seconds.

Do I need Langfuse or LangSmith if I'm using Prism?

Depends on the workload. If you're running pure-API workloads (one user message → one model response, no agent step trees), Prism's gateway-layer observability is often sufficient. If you're running agents, multi-step workflows, or doing serious prompt iteration with A/B testing, you'll want Layer 2 (Langfuse or LangSmith) alongside Prism. They solve different problems at different layers.

What's the difference between observability and evaluation?

Observability is "what happened" — capture data, surface dashboards, debug incidents. Evaluation is "how good was it" — score outputs against expected outcomes, A/B test prompts, measure quality. Layer 2 (platforms) does observability deeply; Layer 3 (eval) layers on top. Both come from the same vendor in most cases (Langfuse and LangSmith both ship both layers).

Should I capture feedback on every response or sample?

Sample. Asking for feedback on every response causes fatigue — users stop responding, the signal degrades. A common pattern: present a thumb-up/thumb-down UI on every response (zero-friction, ~10% capture rate), prompt for a rating + comment on 5% of responses. Adjust based on response rates.

How does cost attribution actually work?

At the gateway layer, every request carries one or more tags (e.g. X-Prism-Tags: feature=chat,team=growth,env=production). The gateway persists these on the usage log row. The dashboard aggregates by tag — per-feature cost per day, per-team monthly spend, etc. The discipline that makes it work is consistent tagging via a shared client wrapper.

What latency percentiles should I track?

p50, p95, p99. p50 (median) is what most users experience; p95 is the slow tail (1 in 20 requests); p99 is the truly slow tail (1 in 100). Further percentiles (p99.9, p99.99) have noisy sample sizes at most production volumes; treat them as SRE problems, not optimisation targets.

Can I use OpenTelemetry for LLM observability?

Yes — both Langfuse and LangSmith accept OTel traces. If your application already emits OTel spans, you can add LLM-specific span attributes (model, tokens, cost) and route them to a Langfuse/LangSmith backend without adopting a vendor-specific SDK. The trade is some LLM-specific affordances (span types for agent steps, tool calls) need manual annotation when using generic OTel.

Does observability slow requests down?

Negligibly at the gateway layer — the proxy is already in the request path, so logging adds microseconds. Negligibly at the platform layer — SDK trace emission is typically async / non-blocking. The only meaningful overhead comes from synchronous quality-scoring evaluators (Layer 3) that block on a judge LLM call; those should run async or on sampled traffic to avoid impacting user-perceived latency.

Observability and cost engineering are complementary disciplines. Read the AI API caching guide and the LLM budget governance guide for the cost side. The AI gateway comparison covers the gateway choice that drives Layer 1.

LLM observability in 2026: what you need, what you don't, and the framework for picking

Why a three-layer model

Layer 1 — Gateway observability

Prism's gateway-layer observability

Layer 2 — Platform observability

Picking a platform-layer tool

Layer 3 — Evaluation-driven observability

What good cost attribution looks like

Latency observability — the percentiles that matter

Feedback capture — the quality signal

How Prism implements LLM observability

Build vs buy

Decision framework

Where to go next

Frequently asked questions

Deep dives on llm observability

See your savings before you sign up

Frequently asked questions

Related reading

AI API Caching

LLM Cost Reduction

Llm observability what to instrument first

LLM observability in 2026: what you need, what you don&apos;t, and the framework for picking

Deep dives on llm observability

See your savings before you sign up

Frequently asked questions

Related reading

AI API Caching

LLM Cost Reduction

Llm observability what to instrument first

LLM observability in 2026: what you need, what you don't, and the framework for picking