opncrafter

How to Monitor and Debug LLM Agents in Production

Debugging an autonomous agent feels like investigating a crime scene. You arrive after the event, armed only with logs, trying to piece together the agent's "thought process."

The Anatomy of an Agent Failure

Traditional software fails via exceptions or crashes. LLM agents usually fail gracefully but catastrophically. The workflow completes, but the result is a hallucinated document or an endless loop of retries.

The Infinite Loop

The agent encounters an unexpected API error, fails to parse it, and repeatedly attempts the exact same action, burning through tokens until a hard timeout hits.

Hallucinated Parameters

The agent decides to call a database query tool but invents table names or SQL syntax that completely break the downstream system.

Implementing Agent Telemetry

To debug an agent, you need context-rich telemetry. You must intercept the workflow at three critical boundaries:

1. The LLM Boundary (Prompt/Completion Tracing)

Record exactly what was sent to the LLM (including the entire system prompt, injected memory, and conversation history) and what raw string or JSON the LLM returned.

import langfuse
from langfuse.decorators import observe

@observe()
def execute_agent_step(task, context):
    # This automatically captures latency, token usage, and 
    # generation output for this specific node in the graph
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": task}]
    )
    return response.choices[0].message.content

2. The Tool Execution Boundary

When the agent invokes an external tool (e.g., Python REPL, Web Search), you must capture the exact arguments passed to the tool and the raw data the tool returned to the agent.

Debugging Tip: If an agent makes a weird decision in Step 4, it is almost always because the Tool Output from Step 3 was malformed, paginated improperly, or too dense for the LLM to comprehend.

3. The Vector Retrieval Boundary

If your agent uses RAG to fetch context before deciding, log the retrieved chunks and their similarity scores. A surprisingly high percentage of agent "hallucinations" are just the agent accurately summarizing bad data retrieved from the vector database.

Building an Agent Debugging Dashboard

In a production environment, you should be tracking these specific KPIs on your dashboard:

  • Average Steps per Task: Spikes indicate agents are getting confused and looping.
  • Tool Error Rate: The percentage of tool calls that return exceptions back to the agent.
  • Token Burn Rate per Session: Critical for identifying runaway agents.
  • Human Intervention Rate: How often does the system escalate to a human operator?

Replay Debugging

The most advanced debugging technique in AgentOps is Deterministic Replay. By saving the exact tool outputs and external API responses from a failed session, you can re-run the agent locally against the mock data. This allows you to tweak the system prompt and verify if the agent would handle the situation correctly "the second time around."

Advanced Telemetry: LangGraph State Inspection

If you are using state-machine architectures like LangGraph, debugging becomes slightly easier because the agent's logic is explicitly separated from the LLM prompt. By injecting breakpoints into the Graph's edges, you can inspect the exact JSON state before the agent acts.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
app = workflow.compile(checkpointer=memory, interrupt_before=["execute_trade"])

# The graph runs and stops right before the dangerous action
result = app.invoke({"ticket_id": "12345"}, config)

# We can inspect what the agent PLANS to do
print(app.get_state(config).values["proposed_action"])

By combining interruption breakpoints with human-in-the-loop approvals, you can monitor agents safely in production while gathering a dataset of "agent mistakes vs human corrections" to fine-tune future models.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK