jatin.blog ~ $
$ cat ai-engineering/llm-observability.md

Production Tracing and Observability for LLM Systems

Distributed tracing for LLM apps in 2026: span shape, OTel GenAI semantics, OpenInference, sampling, and the LangSmith/Langfuse/Phoenix decision.

Jatin Bansal@blog:~/ai-engineering$ open llm-observability

A team ships an agent on Tuesday. By Friday, support is escalating a thread of complaints — answers are slow, occasionally truncated, and the assistant keeps inventing order IDs. The engineer pulls the application logs. Each row is a model call with input length, output length, latency, and total cost — nothing else. Which step of the agent loop emitted the bad order ID? No idea, because each turn is one log line and the loop ran five turns. Did the retriever return the right document? No idea, because retrieval is in a separate service whose logs are in a different bucket. Was the cache cold? No idea, because nothing instrumented the cache_read_input_tokens field. Two hours into the investigation the team is rebuilding what a trace would have shown them in thirty seconds. This is the failure mode every LLM system hits the first time something goes wrong in production, and the discipline that prevents it is distributed tracing — but adapted to the shape of an LLM application, not bolted on from a microservices playbook.

Opening bridge

Yesterday’s piece on LLM-as-judge closed the evaluation top-tier: rubrics, biases, calibration. The judge is the offline regression measurement; production observability is its online counterpart. The eval-driven development article flagged a placeholder in its workflow — “Log every model call with its inputs, retrieved context, output, latency, token counts, and cost. The production-tracing layer is a separate concern (covered later in this subtree)” — and the agent harness anatomy article named telemetry as the seventh harness duty, dropping a one-liner about Langfuse, Arize Phoenix, LangSmith, and OpenTelemetry collectors without working the shape. Today’s piece cashes both promises. It is the load-bearing internals: what an LLM trace must capture, what schema it speaks, how it gets sampled, and which platform fits which workload.

Definition

LLM observability is the practice of capturing structured, queryable telemetry from every model call, tool invocation, retrieval, and evaluator pass that runs inside an LLM application — at a span granularity that lets one engineer reconstruct any failed turn in under five minutes, against a schema portable enough that the choice of vendor isn’t a one-way door. Three properties separate it from “we log the API calls.” First, the unit is the span, not the request. A single user turn produces a tree of spans — the parent turn, the model call(s), the tool call(s), the retriever call(s), the judge call(s) — linked by trace ID and parent span ID, the same shape the W3C trace-context spec defines for any distributed system. Second, the payload is structured. Prompts, completions, tool arguments, tool results, retrieved chunks, token counts, model name, cache state, latency — every field a typed attribute on the span, queryable in the trace store. Third, the schema is portable. The OpenTelemetry GenAI semantic conventions and OpenInference define vendor-agnostic attribute names for gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.prompt.0.content, and so on, so a span emitted by a LangChain agent looks identical to one emitted by a raw Anthropic call.

Intuition

The mental model: an LLM application is a distributed system whose nodes happen to be model calls and tool invocations instead of microservices. The operational primitives transfer one-to-one. A trace ID is the unit you’d cite to a teammate when escalating a bug. A span is the unit you’d attach an SLO to. Parent/child relationships are how you reconstruct the causal chain. Attributes are the structured fields you’d query in a log aggregator. Sampling is the way you afford keeping production traffic observable without sending every payload to durable storage. PII redaction is the policy boundary every regulated workload runs into the moment it ships.

The piece that doesn’t transfer cleanly is the cost of a payload. A microservice trace is dominated by metadata; an LLM span carries the prompt and the completion, which are typically the largest payloads in your stack. A nightly judged eval over 500 rows × 4 rubrics × 4 candidate models stores 32K payloads, each averaging tens of kilobytes — gigabytes per night, all of which has to be queryable for at least a few months. The economics push the design toward aggressive sampling at the edge, lossless capture for the sample, and a sidecar pipeline that ships PII-scrubbed payloads to durable cold storage. The trace is the artifact, not the line in the application log.

The other thing that doesn’t transfer: in a microservices trace, the thing being traced is deterministic; in an LLM trace, the thing being traced is statistical. A span saying “the model emitted X” is a sample from a distribution. The trace is the evidence; the eval suite is the metric. Observability tells you what happened on this specific turn; evals tell you what happens on average across the regression set. The two are complementary — observability without evals leaves you debugging without baselines, and evals without observability leaves you measuring averages without the per-trace evidence to investigate a regression.

The distributed-systems parallel

The closest analogue is OpenTelemetry tracing as it landed in microservices between 2017 and 2021. Before OTel, every vendor (Datadog, New Relic, Lightstep, Zipkin, Jaeger) shipped a proprietary trace format, instrumentation libraries were lock-in vectors, and “switch APM vendors” meant re-instrumenting the application. OTel’s bet was that the data model — span, trace, context propagation, semantic conventions — could be standardized across vendors so that instrumentation became a portability layer, not a lock-in surface. The LLM observability space is at the same inflection point in 2026. The OpenTelemetry GenAI semantic conventions, developed since April 2024 and partially stable as of March 2026, define gen_ai.client spans (stable; one per LLM round trip) and gen_ai.agent spans (experimental; one per agent invocation). Datadog now consumes OTel GenAI spans natively, Langfuse is built on OTel, and Arize’s OpenInference is converging with the OTel conventions rather than competing with them. The teams that instrument against OTel today are the ones that can switch backends without rewriting in 2027.

The deeper parallel is the log-vs-metric-vs-trace pillar split familiar from SRE. The three pillars are not interchangeable: logs are the unstructured catch-all, metrics are the cheap aggregates for dashboards and alerts, traces are the per-request reconstruction surface. LLM observability needs all three. Metrics — judge mean per category per day, cache hit rate, p95 latency, dollars per session — are the dashboard. Logs — the raw request and response, scrubbed and stored — are the long-tail forensic archive. Traces — the per-turn span tree with payloads, latencies, and parent relationships — are the investigation surface. A pipeline that captures only one pillar is the pipeline that goes blind on the failure modes the other two would have caught.

There’s a real disanalogy. Microservice traces converge over time toward a stable shape as the architecture stabilizes; LLM traces diverge as the application grows tool catalogs, memory layers, evaluator pipelines, and multi-agent fan-out. The span schema has to grow with the application surface, and an under-specified attribute set (no cache_creation_input_tokens vs cache_read_input_tokens, no per-tool latency, no retrieved-chunk IDs) silently degrades the investigation surface. Treat the schema as a versioned artifact and review it whenever a new capability ships.

Mechanics: the span shape

A working LLM trace for an agent turn has six span types. The relationships between them are what turn flat logs into a debuggable surface.

  1. The session span. One per user conversation. Holds tenant ID, user ID, session start/end timestamps, total cost and tokens rolled up from children. The query unit when support escalates a thread.
  2. The turn span (parent). One per user → assistant exchange. Children are everything that ran inside that exchange. Carries the user message, the final assistant message, the cumulative cost, the cumulative latency, and signal flags (error, budget_breach, compaction_triggered). Parent span of every other span in the turn.
  3. The generation span (LLM call). The OTel gen_ai.client span. One per messages.create call to a provider. Attributes: gen_ai.request.model, gen_ai.system (anthropic/openai/…), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_read_input_tokens, gen_ai.usage.cache_creation_input_tokens, gen_ai.response.finish_reasons, latency, cost. Payload as gen_ai.prompt.{N}.content and gen_ai.completion.{N}.content events (or, in the OpenInference convention, llm.input_messages and llm.output_messages as JSON-encoded attributes).
  4. The tool call span. One per tool_use block dispatched. Attributes: tool.name, tool.arguments (JSON), tool.result (JSON), tool.error (if any), latency, retry count. The child of the generation span that emitted the tool_use, which is in turn the parent of the next generation span that consumes the tool_result. This is the loop body of the agent loop article made visible.
  5. The retrieval span. One per retriever call. Attributes: retrieval.query, retrieval.top_k, retrieval.documents (list of {id, score, snippet} records), latency. Child of the tool-call span when retrieval is exposed as a tool; child of the turn span directly when retrieval runs in a JIT context loop outside the model’s tool surface.
  6. The evaluator/judge span. One per LLM-judge call. Attributes: eval.rubric, eval.score, eval.reasoning, judge model name. Linked to the turn span by the same trace ID but typically run after the turn closes — sampled traces get re-scored offline, and the eval spans append to the existing trace rather than living in a separate store. This is the bridge between observability and the eval pyramid: the judge is offline, but the trace it scored is the production artifact.

The non-obvious requirement is the parent linkage discipline. A turn span with three generation spans as siblings (no causal ordering) is much less debuggable than a turn span with a chain of generation → tool-call → generation spans, where each generation’s parent is the tool-call whose result it consumed. The chain encodes the ReAct loop’s iteration count, and lets you ask “show me turns where the second generation emitted an error” — the kind of query that catches the failure mode where the loop driver is hiding an exception three frames deep.

Mechanics: OTel GenAI conventions vs OpenInference

Two semantic conventions compete in 2026. Choose deliberately; they are converging but not identical.

OpenTelemetry GenAI semantic conventions are the OTel SIG’s standardization, developed since April 2024. The gen_ai.client span is stable as of late 2025; the gen_ai.agent span and a number of attribute names are still experimental in mid-2026. The convention uses log-event attachments for prompt/completion content (so payloads can be sampled separately from span attributes), which fits OTel’s existing log pipeline shape but makes naive trace UIs awkward — the message content is in events, not attributes. Vendor support: Datadog, Honeycomb, Langfuse, and increasingly Phoenix.

OpenInference is Arize’s parallel convention, originally designed for Arize Phoenix and shipped as the default instrumentation for LangChain, LlamaIndex, and many SDKs. OpenInference uses span attributes (llm.input_messages, llm.output_messages, llm.token_count.*, tool.name, tool.parameters) rather than events, which is friendlier to most trace viewers but bloats individual spans. The convention covers retrieval (retrieval.documents), agents (agent.span.kind), and embeddings (embedding.embeddings) in addition to the basic LLM call.

In practice, the two are close enough that a translation layer between them is straightforward, and OpenInference’s roadmap tracks the OTel work explicitly. The pragmatic rule: if your target backend is Phoenix or Arize AX, instrument against OpenInference; if your target is anything else (Langfuse, Datadog, LangSmith, Honeycomb, a vanilla OTel collector), instrument against OTel GenAI conventions. Auto-instrumentation libraries (the OpenInference SDKs, the OTel auto-instrumentations) handle most of the per-framework wiring; the choice is mostly about which attribute names land in your queries.

Mechanics: sampling and PII

Two policies determine whether the trace pipeline is sustainable or a cost runaway.

Sampling. A naive pipeline that stores every prompt and completion runs into payload-cost economics fast. The pattern that survives: head-based sampling for cheap aggregates, tail-based sampling for forensic traces, plus 100% retention of error/anomaly traces. Concretely:

  • Every span is summarized into a metric row (no payload, just attributes) and shipped to the metrics store. Cheap, sub-millisecond, every turn.
  • A configurable fraction of spans — 1-10% in production, 100% in staging — have their payloads shipped to the trace store. Random selection by trace ID so all spans in a sampled trace are kept together.
  • Any trace with an error flag, a budget_breach, a judge.score below a threshold, or a user-submitted thumbs-down is force-sampled at 100% regardless of the random sample. Forensic capture for the cases you actually need to debug.
  • The metric store is queried for trends; the trace store is queried for investigation. A bug report arrives with a session ID; the trace store has the session’s spans; the metric store doesn’t.

The tail-based variant requires a buffer at the collector that holds spans until the trace closes, then samples based on aggregate criteria — useful when “errors” aren’t known at span start. Production teams typically combine: head-based for the baseline sample, tail-based with the OTel collector’s tail_sampling_processor for the error/anomaly augmentation.

PII. Prompts contain user data; completions contain model-generated data that sometimes reflects the user data verbatim. The minimum-viable policy: a scrubbing layer between instrumentation and export, with named-entity recognition (regex first for emails, phone numbers, credit-card patterns; an NER pass for names and addresses if the workload warrants it), a deterministic hash for fields that need to be queryable across spans without being identifying (e.g. user IDs), and a versioned scrubbing manifest checked into the repo so changes go through review. The GDPR right-to-be-forgotten story applies here too — a trace containing a deleted user’s data is a deletion-target the same way the memory store is.

A subtler PII issue: judged eval spans append after the fact to a production trace, and the judge’s reasoning may quote the user’s content. If your scrubbing runs at instrumentation time but the judge runs at eval time, the judge’s output bypasses the scrubber. Add a separate scrub pass on judge-emitted attributes, or run the judge against the already-scrubbed payload and accept the (usually small) quality hit.

Mechanics: cost and cache attribution

Two attributes that production teams find themselves wishing they had captured from day one: cache state and cost attribution.

Cache state. The prompt-caching article made the case for cache_creation_input_tokens vs cache_read_input_tokens being first-class. The corresponding observability requirement: every generation span tags those two counts as separate attributes, and a derived cache_hit_rate = cache_read / (cache_read + cache_creation + cache_miss_input) metric lands on the per-session dashboard. The agent harness anatomy article flagged that a harness without cache telemetry runs at 5-10× the cost it should be — invisible until the invoice arrives. The trace is the only place to see which turn killed the cache, which is the only way to fix the assembly bug at its source.

Cost. Per-span cost in dollars (computed from token counts and a pricing table) lets the trace store answer “show me the top 10 most expensive turns this week,” “what’s the cost distribution by tool,” “which tenants are heavy users.” Tagging the session span with tenant_id and the generation span with model.id and model.tier enables per-tenant per-model rollups that finance asks for once a quarter. Pricing tables change quarterly across providers; pin the pricing-table version into the trace store at the span attribute level (cost.pricing_version) so historical comparisons stay coherent across rate changes.

Code: instrumented Python harness with Langfuse

Langfuse’s Python SDK is the cleanest entry point for a self-hostable, OTel-native trace pipeline. The harness below wraps an Anthropic agent with the loop body from the agent loop article, emits the span shape above, and ships to a Langfuse backend. Install: pip install langfuse anthropic.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
# pip install langfuse anthropic
import os, time, hashlib
from anthropic import Anthropic
from langfuse import get_client

# Langfuse client picks up LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST env vars.
langfuse = get_client()
anthropic_client = Anthropic()

MODEL = "claude-sonnet-4-6"
PRICING = {"claude-sonnet-4-6": (3.00, 15.00)}  # $/MTok input, output

def scrub(text: str) -> str:
    """Naive PII scrubbing; production uses a real NER pipeline."""
    # Hash emails for queryability without identification.
    import re
    return re.sub(
        r"[\w.+-]+@[\w-]+\.[\w.-]+",
        lambda m: f"<email:{hashlib.sha256(m.group().encode()).hexdigest()[:8]}>",
        text,
    )

def run_turn(session_id: str, tenant_id: str, user_msg: str, tools: list, dispatch):
    # Turn span is the parent of every other span in this iteration.
    with langfuse.start_as_current_observation(
        as_type="span",
        name="agent.turn",
        input={"user_msg": scrub(user_msg)},
        metadata={"session_id": session_id, "tenant_id": tenant_id},
    ) as turn:
        messages = [{"role": "user", "content": user_msg}]
        total_cost = 0.0
        for step in range(10):  # step cap; see agent-loop article
            t0 = time.monotonic()
            # Generation span — OTel-style gen_ai.client attributes.
            with langfuse.start_as_current_observation(
                as_type="generation",
                name="anthropic.messages.create",
                model=MODEL,
                input=[{"role": m["role"], "content": scrub(str(m["content"]))} for m in messages],
                metadata={"step": step},
            ) as gen:
                resp = anthropic_client.messages.create(
                    model=MODEL, max_tokens=1024, tools=tools, messages=messages,
                )
                latency_ms = int((time.monotonic() - t0) * 1000)
                # Token + cache attribution.
                in_tok = resp.usage.input_tokens
                out_tok = resp.usage.output_tokens
                cache_read = getattr(resp.usage, "cache_read_input_tokens", 0) or 0
                cache_create = getattr(resp.usage, "cache_creation_input_tokens", 0) or 0
                cost = (in_tok / 1e6) * PRICING[MODEL][0] + (out_tok / 1e6) * PRICING[MODEL][1]
                total_cost += cost
                gen.update(
                    output=scrub(str(resp.content)),
                    usage_details={
                        "input": in_tok, "output": out_tok,
                        "cache_read_input_tokens": cache_read,
                        "cache_creation_input_tokens": cache_create,
                    },
                    cost_details={"total": cost, "pricing_version": "2026-05"},
                    metadata={"latency_ms": latency_ms, "finish_reason": resp.stop_reason},
                )

            # Append assistant turn.
            messages.append({"role": "assistant", "content": resp.content})
            if resp.stop_reason != "tool_use":
                turn.update(output=scrub(str(resp.content)), metadata={"total_cost_usd": total_cost})
                return resp

            # Dispatch each tool_use block as its own span.
            tool_results = []
            for block in resp.content:
                if block.type != "tool_use":
                    continue
                with langfuse.start_as_current_observation(
                    as_type="span",
                    name=f"tool.{block.name}",
                    input=block.input,
                ) as tool_span:
                    t1 = time.monotonic()
                    try:
                        result = dispatch(block.name, block.input)
                        is_error = False
                    except Exception as e:
                        result = str(e)
                        is_error = True
                    tool_span.update(
                        output=result if not is_error else None,
                        metadata={
                            "latency_ms": int((time.monotonic() - t1) * 1000),
                            "is_error": is_error,
                        },
                        level="ERROR" if is_error else "DEFAULT",
                    )
                tool_results.append({
                    "type": "tool_result", "tool_use_id": block.id,
                    "content": str(result), "is_error": is_error,
                })
            messages.append({"role": "user", "content": tool_results})

        turn.update(
            metadata={"budget_breach": True, "reason": "step_cap"},
            level="WARNING",
        )

# Flush at process exit for short-lived applications.
import atexit; atexit.register(langfuse.flush)

Three things to flag. First, every span is opened with start_as_current_observation, which uses Python’s context-manager protocol to ensure parents are set correctly even on exception paths — a custom span manager that forgets this drops the parent linkage on error, which is exactly when you need the trace. Second, usage_details carries the cache attribution as separate fields so the cache-hit-rate metric is computable per span without re-parsing the payload. Third, cost.pricing_version is pinned at the span level — when the provider changes pricing, historical traces still reconstruct the cost they incurred at the time, not the cost they would incur today.

Code: TypeScript with OTel and the Vercel AI SDK

On the TypeScript side, the Vercel AI SDK supports OpenTelemetry tracing natively via the experimental_telemetry option, which emits OTel GenAI spans against any OTLP-compatible backend. The harness below uses @opentelemetry/sdk-node for span export, the AI SDK’s generateText and tool primitives for the loop body, and a custom span around each tool dispatch. Install: npm install ai @ai-sdk/anthropic zod @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
// npm install ai @ai-sdk/anthropic zod @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { trace, SpanStatusCode } from "@opentelemetry/api";
import { generateText, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

// OTel SDK — ships to any OTLP collector (Langfuse, Phoenix, Honeycomb, Datadog).
const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  serviceName: "support-agent",
});
sdk.start();
const tracer = trace.getTracer("support-agent");

function scrub(text: string): string {
  // Hash emails; production uses NER.
  return text.replace(
    /[\w.+-]+@[\w-]+\.[\w.-]+/g,
    (m) => `<email:${require("crypto").createHash("sha256").update(m).digest("hex").slice(0, 8)}>`,
  );
}

const getOrderStatus = tool({
  description: "Look up the status of a customer order by ID.",
  inputSchema: z.object({ order_id: z.string() }),
  execute: async ({ order_id }) =>
    tracer.startActiveSpan(
      "tool.get_order_status",
      { attributes: { "tool.name": "get_order_status", "tool.parameters": JSON.stringify({ order_id }) } },
      async (span) => {
        try {
          const result = await lookupOrder(order_id); // app-specific
          span.setAttribute("tool.output", JSON.stringify(result));
          return result;
        } catch (e: unknown) {
          span.recordException(e as Error);
          span.setStatus({ code: SpanStatusCode.ERROR });
          throw e;
        } finally {
          span.end();
        }
      },
    ),
});

async function runTurn(sessionId: string, tenantId: string, userMsg: string) {
  return tracer.startActiveSpan(
    "agent.turn",
    {
      attributes: {
        "session.id": sessionId,
        "tenant.id": tenantId,
        "input.value": scrub(userMsg),
      },
    },
    async (turnSpan) => {
      try {
        const { text, usage, finishReason, providerMetadata } = await generateText({
          model: anthropic("claude-sonnet-4-6"),
          tools: { get_order_status: getOrderStatus },
          messages: [{ role: "user", content: userMsg }],
          stopWhen: ({ steps }) => steps.length >= 10,  // step cap
          // AI SDK emits OTel GenAI spans for every generation step.
          experimental_telemetry: {
            isEnabled: true,
            functionId: "agent-turn",
            metadata: { session_id: sessionId, tenant_id: tenantId },
            // Strip prompts/completions from spans when sampling decisions need to skip payloads.
            recordInputs: true,
            recordOutputs: true,
          },
        });
        const cacheRead =
          (providerMetadata?.anthropic?.cacheReadInputTokens as number) ?? 0;
        const cacheCreate =
          (providerMetadata?.anthropic?.cacheCreationInputTokens as number) ?? 0;
        turnSpan.setAttributes({
          "gen_ai.usage.input_tokens": usage.inputTokens ?? 0,
          "gen_ai.usage.output_tokens": usage.outputTokens ?? 0,
          "gen_ai.usage.cache_read_input_tokens": cacheRead,
          "gen_ai.usage.cache_creation_input_tokens": cacheCreate,
          "agent.finish_reason": finishReason,
          "output.value": scrub(text),
        });
        return text;
      } catch (e: unknown) {
        turnSpan.recordException(e as Error);
        turnSpan.setStatus({ code: SpanStatusCode.ERROR });
        throw e;
      } finally {
        turnSpan.end();
      }
    },
  );
}

declare function lookupOrder(id: string): Promise<{ status: string }>;

The Vercel AI SDK’s experimental_telemetry block does most of the heavy lifting — it emits an ai.generateText span per top-level call, an ai.generateText.doGenerate span per LLM round-trip with OTel GenAI attributes, and an ai.toolCall span per dispatched tool. The custom tool.get_order_status span layered on top adds tool-specific attributes that the auto-instrumentation doesn’t know about (parameter values, tool output, errors). The result is a span tree that any OTLP-compatible backend can render: Langfuse, Phoenix, Honeycomb, Datadog, or a vanilla Jaeger for development.

Build vs buy: the 2026 platform landscape

Five platforms anchor the production conversation. The right one depends on framework lock-in tolerance, deployment model, eval rigor, and whether observability has to integrate with an existing APM.

LangSmith. Path of least friction for teams on LangGraph or LangChain — auto-instrumentation is one import away, traces include node-by-node state diffs and replay against new model versions, and the eval workflow (datasets → experiments → comparison) is integrated rather than bolted on. Trade-off: closed-source, hosted-only, and noticeably less useful when the application isn’t built on LangChain. If your stack is LangGraph end-to-end, LangSmith is the default.

Langfuse. The open-source leader. Self-hostable (Postgres + ClickHouse), framework-agnostic, built on OTel with native auto-instrumentation for LangChain, LlamaIndex, Vercel AI SDK, Anthropic, OpenAI, and most agent frameworks. Strong on cost attribution, prompt management, and self-hosting (the cloud product runs the same code as the open-source release). The default choice when you want vendor-flexibility, data sovereignty, or both.

Arize Phoenix. Arize built ML observability before LLMs were a thing, and Phoenix inherits that rigor — the eval product, drift detection, and dataset-comparison surface are deeper than Langfuse’s. Uses OpenInference as the native convention (a superset of OTel for most LLM use cases). Phoenix is open-source and self-hostable; Arize AX is the enterprise hosted product. The right choice for teams whose evaluation needs run ahead of the basic-trace use case.

Datadog LLM Observability / Honeycomb LLM Observability. Native OTel GenAI consumers. The right choice when the platform team already runs an APM and the LLM workload is one service among many — same dashboards, same alerting, same on-call rotation. Trade-off: less LLM-specific affordance (prompt management, dataset curation, judge-as-a-service) than a purpose-built platform.

Helicone. Proxy-based, drop-in for OpenAI/Anthropic-compatible APIs. Lowest friction to ship (set a base URL, you get a trace), but the proxy boundary limits attribute richness — tool calls and retrieval steps that don’t go through the proxy aren’t in the trace.

The decision rule that holds up: pick the observability platform by framework + team shape, not by feature checklist. LangSmith if you’re LangGraph end-to-end. Langfuse if you want OSS or self-host. Phoenix if your evaluation rigor is the bottleneck. Datadog/Honeycomb if it has to integrate with an existing APM. Helicone if you need a trace today and will deepen later. Instrument against OTel GenAI conventions either way — the schema is portable, the backend isn’t a one-way door, and the day you switch vendors the migration cost is bounded.

Trade-offs, failure modes, gotchas

Auto-instrumentation versus manual spans. Auto-instrumentation (the OpenInference SDKs, Langfuse’s @observe decorator, the AI SDK’s experimental_telemetry) catches the easy 80% — every LLM call gets a span, every tool dispatched through the framework gets a span. Manual instrumentation handles the long tail: custom retrieval flows, judged eval spans, side-effects in tool implementations. A pipeline that relies only on auto-instrumentation captures what the framework knows about; a pipeline that only uses manual instrumentation grows brittle. Use both.

Sampling at instrumentation versus sampling at export. Two different things; teams confuse them. Instrumentation-time sampling — “only emit spans for 5% of turns” — saves both CPU on the application and bandwidth to the collector, but loses the ability to retroactively up-sample for error traces because the trace was never emitted. Export-time sampling — emit everything, drop at the collector — costs more CPU and bandwidth, but the collector can decide which traces to keep based on tail signals (errors, latency, judge score). Production pipelines use export-time sampling with the OTel collector’s tail-sampling processor for forensic capture, falling back to instrumentation-time sampling only when the application’s emit cost becomes a bottleneck.

Payload size is the dominant cost driver, not span count. A trace store priced per span will quote you one number; the actual storage cost is dominated by the payloads (prompts, completions, retrieved chunks). A nightly judged eval over a 500-row golden set with four rubrics emits ~8K spans but stores ~200MB of payload. Plan capacity from the payload side, not the span side.

Don’t trace through the eval suite by default. The eval-driven development article sampling pattern — run the cheap layer on every PR, the judged layer nightly — implies the eval suite emits thousands of LLM calls per night. Routing those through the production trace pipeline poisons your production-trace cost and dashboard SLOs with eval traffic that has different characteristics. Either tag eval spans with environment=eval and filter them out of production dashboards, or run evals against a separate trace endpoint with its own retention policy.

OpenInference vs OTel attribute naming is a portability tax. The two conventions overlap heavily but not entirely; llm.input_messages (OpenInference) and gen_ai.prompt.0.content (OTel) carry the same data with different names. A pipeline that hard-codes one set of attribute names in dashboards or alerts will fight a translation tax the day it migrates backends. Either commit fully to OTel GenAI conventions and accept that Phoenix-native features may need a translation layer, or commit to OpenInference and accept that vanilla OTel backends won’t auto-render the LLM-specific UI without configuration.

The judge span has to live in the same trace as the production turn. A common anti-pattern is running the judge in a separate pipeline that emits to a separate store, then trying to reconstruct the cross-store join when a regression hits. The right shape is to append the judge span to the existing trace using the production trace ID — the judge runs offline, but it scored a specific turn, and the trace is the natural artifact to attach the score to. Most trace stores support late-arriving spans against an existing trace ID; use that.

Multi-agent traces look like graphs, not threads. The multi-agent orchestration article flagged this explicitly: when agent A hands off to B which hands back to A, the trace structure is a graph traversal rather than a linear log. Most modern viewers (Langfuse, LangSmith, Phoenix) render these correctly, but custom dashboards built against an assumed linear trace shape break. If you’re building multi-agent, validate the trace UI handles the topology before you commit.

Trace privacy crosses tenant boundaries the same way the memory store does. A trace from tenant A that the support engineer queries while debugging is the same data the memory privacy article covers — RBAC over traces, audit-logs over trace access, deletion when a tenant offboards, regional data residency for traces of EU users. The trace store is also a system of record for PII, not just an operational artifact, and the GDPR/SOC2 boundary applies.

Open the trace before opening the codebase. When an incident comes in, the reflex of every engineer trained in microservices is to read the code first. For LLM systems, the trace tells you more than the code does — what the model actually said, which tools it actually called, what the retriever actually returned. The code is a hypothesis about behavior; the trace is the behavior. Train the team to open the trace viewer first.

Further reading

  • Eval-Driven Development for LLM Systems — the offline counterpart to today’s online discipline. The trace store captures what happened; the eval suite captures what should happen on average. Both are required; neither replaces the other.
  • LLM-as-Judge: Pointwise and Pairwise — the evaluator that runs against sampled traces. The judge span attaches to the production trace by the same trace ID, which is what makes per-trace investigation actionable when a regression hits.
  • Drift Detection and Regression Testing for LLM Systems — the control loop the trace store feeds. Input-distribution drift, output-feature drift, and concept drift all read from the spans this article defines; the regression-testing protocol for model upgrades runs on top of the same substrate.
  • Human-in-the-Loop Feedback Loops for LLM Systems — what the trace store is for once observability is wired. Capturing structured user feedback against trace IDs, sampling production traces into an annotation queue, and routing the resulting labels back into evals, prompts, retrieval, and (optionally) the model.