How to Monitor and Debug LLM Agents in Production

Debugging an autonomous agent feels like investigating a crime scene. You arrive after the event, armed only with logs, trying to piece together the agent's "thought process."

The Anatomy of an Agent Failure

Traditional software fails via exceptions or crashes. LLM agents usually fail gracefully but catastrophically. The workflow completes, but the result is a hallucinated document or an endless loop of retries.

The Infinite Loop

The agent encounters an unexpected API error, fails to parse it, and repeatedly attempts the exact same action, burning through tokens until a hard timeout hits.

Hallucinated Parameters

The agent decides to call a database query tool but invents table names or SQL syntax that completely break the downstream system.

Implementing Agent Telemetry

To debug an agent, you need context-rich telemetry. You must intercept the workflow at three critical boundaries:

1. The LLM Boundary (Prompt/Completion Tracing)

Record exactly what was sent to the LLM (including the entire system prompt, injected memory, and conversation history) and what raw string or JSON the LLM returned.

import langfuse
from langfuse.decorators import observe

@observe()
def execute_agent_step(task, context):
    # This automatically captures latency, token usage, and 
    # generation output for this specific node in the graph
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": task}]
    )
    return response.choices[0].message.content

2. The Tool Execution Boundary

When the agent invokes an external tool (e.g., Python REPL, Web Search), you must capture the exact arguments passed to the tool and the raw data the tool returned to the agent.

Debugging Tip: If an agent makes a weird decision in Step 4, it is almost always because the Tool Output from Step 3 was malformed, paginated improperly, or too dense for the LLM to comprehend.

3. The Vector Retrieval Boundary

If your agent uses RAG to fetch context before deciding, log the retrieved chunks and their similarity scores. A surprisingly high percentage of agent "hallucinations" are just the agent accurately summarizing bad data retrieved from the vector database.

Building an Agent Debugging Dashboard

In a production environment, you should be tracking these specific KPIs on your dashboard. For each metric, set an alert threshold so your on-call engineer is paged before a runaway agent burns through your entire monthly API budget:

KPI	Healthy Baseline	Alert Threshold	Likely Root Cause
Avg Steps / Task	< 8 steps	> 15 steps	Infinite loop, missing tool, or ambiguous task
Tool Error Rate	< 2 %	> 8 %	Schema drift, downstream API change, or bad prompt
Token Burn / Session	< 12 k tokens	> 50 k tokens	Runaway retry loop or oversized memory injection
Human Escalation Rate	< 5 %	> 20 %	Task scope too broad or confidence thresholds too low

Incident Response Runbook

When an alert fires, follow this 5-step runbook to triage in under 10 minutes:

Identify the session: Pull the trace ID from your alerting system and open it in Langfuse / LangSmith.
Check the last tool call: Did the tool return an error? Was the argument schema correct? 80 % of failures end here.
Check vector retrieval: Were the retrieved chunks relevant? High cosine scores don't always mean useful content.
Replay locally: Use saved tool outputs as mocks and re-run the agent graph locally with the original user prompt.
Fix the root cause: Patch the tool description, adjust the system prompt, or add a guard rail — then add this case to your golden evaluation dataset.

Replay Debugging

The most advanced debugging technique in AgentOps is Deterministic Replay. By saving the exact tool outputs and external API responses from a failed session, you can re-run the agent locally against the mock data. This allows you to tweak the system prompt and verify if the agent would handle the situation correctly "the second time around."

Advanced Telemetry: LangGraph State Inspection

If you are using state-machine architectures like LangGraph, debugging becomes slightly easier because the agent's logic is explicitly separated from the LLM prompt. By injecting breakpoints into the Graph's edges, you can inspect the exact JSON state before the agent acts.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
app = workflow.compile(checkpointer=memory, interrupt_before=["execute_trade"])

# The graph runs and stops right before the dangerous action
result = app.invoke({"ticket_id": "12345"}, config)

# We can inspect what the agent PLANS to do
print(app.get_state(config).values["proposed_action"])

By combining interruption breakpoints with human-in-the-loop approvals, you can monitor agents safely in production while gathering a dataset of "agent mistakes vs human corrections" to fine-tune future models.