Building an AI Agents Observability Stack from Scratch
Running a single LLM call is simple to debug: you have an input, a prompt, and an output. Running an agent that orchestrates multiple LLM calls, tool invocations, and conditional branches is a different operational category. When an agent produces a wrong answer or gets stuck in a loop, you need to answer: which step failed, why did it fail, and was the failure deterministic or stochastic? Without observability infrastructure, the answer is always "I have no idea, let me run it again."
"Observability for AI agents is not a nice-to-have. When your agent makes 15 LLM calls to complete a task and produces a wrong answer, you need to know which call introduced the error, what the intermediate state was, and whether that error is reproducible. Without traces, you are debugging a black box with no instruments."
— Shreya Shankar, PhD Researcher, UC Berkeley, on production ML observability (2024)
The trace ID foundation
Every agent execution needs a trace ID that propagates through every step. This is the most important single investment in observability because it makes everything else possible. Without trace IDs, you cannot correlate a user complaint with the specific execution that caused it, you cannot aggregate latency by execution path, and you cannot build dashboards that distinguish failure modes.
Generate a UUID at the entry point of each agent execution and pass it explicitly through every sub-call. Log it with every LLM API request and response. Log it with every tool call. If you are using LangChain, LlamaIndex, or similar frameworks, most provide callbacks or hooks where you can inject this context.
Token counting and cost attribution
LLM costs are proportional to tokens processed. Agent systems with multi-turn reasoning, long context windows, and multiple model calls can burn tokens fast. Without per-trace token accounting, you have no visibility into which execution paths are expensive, which users or use cases drive disproportionate cost, or when a prompt change accidentally doubled your context size.
Most LLM API responses return token counts in the response metadata. Capture these and log them with the trace ID. Aggregate daily by execution type and by model. If you are spending $X per day on inference, you should know which agent workflow is responsible for each 20 percent band of that cost.
Latency percentiles by step
Average latency is useless for agent systems. A p99 of 45 seconds on an agent that averages 8 seconds tells you there is a failure mode that hits 1 in 100 users with a terrible experience. Track p50, p90, and p99 latency for each step type separately. The step with the worst p99 is almost always where your reliability investment will pay off most.
Failure classification
Not all failures are equal. LLM calls fail in categorically different ways: the model times out (infrastructure), returns malformed JSON (output parsing failure), produces a syntactically valid but semantically wrong answer (capability failure), or refuses to complete the task (policy failure). These require different interventions. Classify the failure type at capture time so your dashboards can separate infrastructure reliability from model capability issues.
LangSmith vs self-hosted
LangSmith provides trace capture, evaluation tooling, and a UI for LangChain-based agent systems. The hosted version is operationally simple and the evaluation features are genuinely useful for iterating on prompts. The cost scales with trace volume, and sending your traces to a third party has data governance implications that matter for enterprise deployments.
For self-hosted observability, OpenTelemetry with a Jaeger or Grafana Tempo backend handles distributed tracing well. Arize Phoenix is purpose-built for LLM observability and can be self-hosted. The right answer depends on your scale and data requirements. At low volume, LangSmith's time-to-value is hard to beat. At high volume or with sensitive data, the self-hosted path makes more sense financially and operationally.
📊By the numbers
| Metric | Finding | Source |
|---|---|---|
| Teams with full LLM call tracing in production | Only 19% | Arize AI State of LLM Observability, 2024 |
| Average time to diagnose an agent failure without tracing | 4–8 hours | Langchain Developer Survey, 2024 |
| Cost reduction from proactive hallucination monitoring | Up to 40% fewer escalations | Weights and Biases LLMOps Report, 2024 |