LLM observability
Instrumentation that captures per-request latency, cost, tokens, cache status, errors, and feedback — the data plane for AI cost engineering.
How it works
LLM observability is the instrumentation that captures the per-request data needed to understand what your LLM-calling application is doing: input/output tokens per request, cost in cents, latency at p50/p95/p99, which model was called, cache hit status, error rate by provider, and (in mature setups) quality feedback signals from users or evaluators. The data flows from a proxy layer or SDK middleware into a time-series store, where it's joined against application-level tags (which feature called this? which team? which user cohort?) to drive cost attribution, capacity planning, and quality assessment.
Conceptually it's the LLM-specific application of general observability principles (metrics, logs, traces) — the same shape as APM tools for HTTP traffic (Datadog, New Relic, Honeycomb) but adapted to the dimensions that matter for LLM workloads. The dimensions that don't transfer cleanly from HTTP-APM are tokens (no HTTP analog), model choice (a routing decision per request, not a static configuration), cache layer (no HTTP cache-control direct analog), and prompt quality (an entirely new concern with no APM precedent).
What to instrument first
For a team starting from zero, the instrumentation order that delivers the most value per hour of effort: (1) cost per request, broken down by feature tag — this is what unlocks AI FinOps decisions; (2) latency p95 per provider per model — this is what unlocks routing decisions; (3) cache hit rate per cache layer — this is what unlocks caching tuning; (4) error rate per provider — this is what unlocks reliability decisions; (5) quality feedback (thumbs-up/down or 1-5 ratings tied to request IDs) — this is what unlocks the closed loop between observability and optimization.
Most teams stop at (1) and (2) because (3)-(5) require deliberate product surface (a feedback API, a tagging convention, a quality-review workflow). That's fine for early-stage teams but leaves real money on the table. Prism ships all five primitives in the proxy layer — every request automatically logs cost + latency + cache status; X-Prism-Tags drives feature attribution; the feedback API closes the loop with thumbs/ratings tied to X-Prism-Feedback-Id.
Gateway-layer observability vs eval-platform observability
Two distinct shapes in the LLM observability market and they solve different problems. Gateway-layer observability (Helicone, Prism, Portkey-observability) instruments every production request — it's high-volume, real-time, focused on cost and operational signal. Eval-platform observability (LangSmith, Langfuse, Braintrust) instruments evaluation traces and prompt-engineering iterations — it's lower-volume, deeper-context, focused on quality and prompt improvement. Mature teams use both; early teams pick one based on which problem dominates today (cost surprises = gateway-layer; quality drift = eval-platform).