Last updated:

LLM observability

Instrumentation that captures per-request latency, cost, tokens, cache status, errors, and feedback — the data plane for AI cost engineering.

How it works

LLM observability is the instrumentation that captures the per-request data needed to understand what your LLM-calling application is doing: input/output tokens per request, cost in cents, latency at p50/p95/p99, which model was called, cache hit status, error rate by provider, and (in mature setups) quality feedback signals from users or evaluators. The data flows from a proxy layer or SDK middleware into a time-series store, where it's joined against application-level tags (which feature called this? which team? which user cohort?) to drive cost attribution, capacity planning, and quality assessment.

Conceptually it's the LLM-specific application of general observability principles (metrics, logs, traces) — the same shape as APM tools for HTTP traffic (Datadog, New Relic, Honeycomb) but adapted to the dimensions that matter for LLM workloads. The dimensions that don't transfer cleanly from HTTP-APM are tokens (no HTTP analog), model choice (a routing decision per request, not a static configuration), cache layer (no HTTP cache-control direct analog), and prompt quality (an entirely new concern with no APM precedent).

What to instrument first

For a team starting from zero, the instrumentation order that delivers the most value per hour of effort: (1) cost per request, broken down by feature tag — this is what unlocks AI FinOps decisions; (2) latency p95 per provider per model — this is what unlocks routing decisions; (3) cache hit rate per cache layer — this is what unlocks caching tuning; (4) error rate per provider — this is what unlocks reliability decisions; (5) quality feedback (thumbs-up/down or 1-5 ratings tied to request IDs) — this is what unlocks the closed loop between observability and optimization.

Most teams stop at (1) and (2) because (3)-(5) require deliberate product surface (a feedback API, a tagging convention, a quality-review workflow). That's fine for early-stage teams but leaves real money on the table. Prism ships all five primitives in the proxy layer — every request automatically logs cost + latency + cache status; X-Prism-Tags drives feature attribution; the feedback API closes the loop with thumbs/ratings tied to X-Prism-Feedback-Id.

Gateway-layer observability vs eval-platform observability

Two distinct shapes in the LLM observability market and they solve different problems. Gateway-layer observability (Helicone, Prism, Portkey-observability) instruments every production request — it's high-volume, real-time, focused on cost and operational signal. Eval-platform observability (LangSmith, Langfuse, Braintrust) instruments evaluation traces and prompt-engineering iterations — it's lower-volume, deeper-context, focused on quality and prompt improvement. Mature teams use both; early teams pick one based on which problem dominates today (cost surprises = gateway-layer; quality drift = eval-platform).

See your savings before you sign up

Run our calculator on your own workload. Real provider rates, real cache math, no email gate.

Frequently asked questions

Is LLM observability different from regular application observability?
Same principles (metrics, logs, traces), different dimensions. Regular APM instruments HTTP request shape, latency, error rate. LLM observability adds per-request tokens, model choice, cache layer, cost in cents, and prompt-quality signals — dimensions that don't exist in HTTP traffic. Tools like Datadog have started adding LLM-specific instrumentation, but LLM-native observability platforms (Helicone, Prism, Langfuse) ship more out-of-the-box.
Do I need a separate tool or can I just use Datadog?
Datadog (or any APM) can capture latency and error rate from HTTP-call instrumentation, but it won't natively understand tokens, cache status, or model choice — you'd have to add custom instrumentation. For early-stage teams that already pay for Datadog, custom instrumentation is faster than adding a new vendor. For teams without existing APM, an LLM-native tool ships everything out of the box.
What's the difference between Helicone, LangSmith, and Prism's observability?
Helicone is gateway-layer observability — high-volume production instrumentation focused on cost and ops. LangSmith is eval-platform observability — lower-volume evaluation traces focused on quality + prompt iteration. Prism is gateway-layer with built-in feedback capture, closer to Helicone's shape but unified with caching + routing + governance in one product. Mature teams often run two of these; early teams pick based on which problem dominates today.
What's the minimum useful instrumentation?
Per-request cost broken down by feature tag, and p95 latency per provider per model. Those two unlock the cost-engineering and routing decisions that drive most of the savings. Cache-hit rate and error rate are close seconds. Quality feedback (thumbs/ratings) is high-leverage but requires deliberate product surface to capture, so most teams add it later.