LangSmith: X-Ray for Agents
Dec 30, 2025 • 20 min read
Agents loop. Agents retry. Agents call tools, get bad responses, recover, and sometimes spiral into infinite loops. When an agent fails, it usually fails silently — 5 steps deep in a chain with no obvious error trace. LangSmith gives you a visual waterfall diagram of every LLM call, every tool invocation, every token spent, and every intermediate reasoning step in your agent's execution — making debugging that would take hours from logs take minutes from the UI.
1. One-Line Integration
# Sign up free at smith.langchain.com → Create API key
# .env file (or set in your environment)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...your_key
LANGCHAIN_PROJECT=my-rag-agent # Groups all traces for this project
# That's it! Every LangChain call is now traced automatically.
# Re-run your existing agent code — it sends traces to LangSmith.
# What gets captured automatically:
# ✓ Full prompts (with system prompt, chat history)
# ✓ Model used + temperature + max_tokens
# ✓ Input/output tokens + cost estimate
# ✓ Latency per step
# ✓ Tool names + inputs + outputs
# ✓ Retriever queries + results
# ✓ Full chain call tree (parent → child relationships)
# ✓ Any errors with full tracebacks2. The @traceable Decorator for Custom Code
You don't need to use LangChain's built-in chains to benefit from LangSmith. The @traceable decorator wraps any Python function:
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
# Trace individual tools
@traceable(run_type="tool", name="search_product_db")
def search_products(query: str, max_results: int = 5) -> list[dict]:
"""This function call will appear as a 'Tool' span in LangSmith."""
results = product_database.search(query, limit=max_results)
return results # LangSmith records the query + results
# Trace the retrieval step separately
@traceable(run_type="retriever", name="rag_retriever")
def retrieve_chunks(query: str) -> list[str]:
chunks = vectordb.similarity_search(query, k=5)
return [c.page_content for c in chunks]
# Trace the main LLM call
@traceable(run_type="llm", name="answer_generation")
def generate_answer(context: list[str], question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer based on the provided context only."
}, {
"role": "user",
"content": f"Context:\n{chr(10).join(context)}\n\nQuestion: {question}"
}]
)
return response.choices[0].message.content
# Top-level chain — LangSmith shows all sub-calls nested under this
@traceable(run_type="chain", name="rag_pipeline")
def rag_query(question: str) -> str:
# LangSmith shows: rag_pipeline → rag_retriever, answer_generation
chunks = retrieve_chunks(question)
answer = generate_answer(chunks, question)
return answer3. Debugging Common Agent Failures
# In LangSmith UI: click any failed trace and look for these patterns:
# Failure Pattern 1: Tool Output Parsing Error
# In trace, look for a 'parse_tool_call' span with error
# Root cause: LLM returned non-JSON when you expected JSON
# Fix: Add explicit JSON instruction + response_format={"type": "json_object"}
# Failure Pattern 2: Infinite Tool Loop
# Look for repeated tool calls in the trace (same tool called 5+ times)
# Root cause: No max_iterations set, LLM stuck in reasoning loop
# Fix:
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
max_iterations=10, # Hard limit on tool calls
max_execution_time=60, # Timeout after 60 seconds
early_stopping_method="force", # Force an answer after max_iterations
handle_parsing_errors=True, # Don't crash on format errors
)
# Failure Pattern 3: Retriever Returns Irrelevant Chunks
# In trace, look at 'retriever' spans → expand "outputs"
# Compare retrieved chunks to the user's actual question
# Fix: Check your embedding model, increase k, add reranking
# Failure Pattern 4: Token Budget Exceeded
# LangSmith shows > 100k input tokens for a single call
# Root cause: Chat history not being truncated
from langchain.memory import ConversationSummaryBufferMemory
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=3000, # Summarize when history exceeds 3k tokens
)4. Dataset Evaluation (Regression Testing)
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
ls_client = Client()
# Step 1: Create your golden dataset in LangSmith
# (or import from an existing trace by clicking "Add to Dataset")
dataset = ls_client.create_dataset(
dataset_name="rag-qa-golden-set-v1",
description="200 curated Q&A pairs with verified correct answers"
)
# Add examples
examples = [
{"inputs": {"question": "What is our refund policy?"},
"outputs": {"answer": "Our refund policy allows returns within 30 days..."}},
# ... 199 more examples
]
ls_client.create_examples(inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id)
# Step 2: Run evaluation when you change your prompt or model
def run_rag(inputs: dict) -> dict:
"""Your agent wrapped for evaluation."""
answer = rag_query(inputs["question"])
return {"answer": answer}
# Built-in evaluators
correctness_evaluator = LangChainStringEvaluator("labeled_criteria", config={
"criteria": {"correctness": "Is this answer factually correct based on the reference?"}
})
results = evaluate(
run_rag,
data="rag-qa-golden-set-v1",
evaluators=[correctness_evaluator],
experiment_prefix="gpt4o-vs-haiku", # Compare experiments in LangSmith UI
num_repetitions=1,
)5. Production Monitoring: Quality Drift Alerts
# LangSmith monitors production traces automatically
# Set up alerts in the UI: Rules → Create Rule
# Example alert rules:
# 1. "Alert if error rate > 5% in any 1-hour window"
# 2. "Alert if median latency > 5 seconds for 10 consecutive traces"
# 3. "Alert if LLM output length drops below 50 tokens" (truncation sign)
# Add user feedback to production traces
from langsmith import Client
ls_client = Client()
# Your API route: POST /api/feedback
async def record_feedback(trace_id: str, score: int, comment: str):
ls_client.create_feedback(
run_id=trace_id, # From X-LangSmith-Trace response header
key="user_satisfaction",
score=score, # 0 or 1 (thumbs down/up)
comment=comment,
)
# LangSmith shows feedback % in the dashboard — track quality over timeFrequently Asked Questions
Is LangSmith free?
LangSmith offers a generous free Developer tier with 3,000 traces/month and 1 project. The Team plan ($39/seat/month) adds unlimited traces, custom evaluators, annotation queues, and team sharing. For monitoring production AI applications, the Team plan is essentially mandatory — 3,000 traces/month is consumed in a few hours by any real application.
Does LangSmith work with non-LangChain code?
Yes — the @traceable decorator works with any Python code: raw OpenAI calls, Anthropic calls, custom retrieval code, custom agents. You can even trace LlamaIndex pipelines by wrapping them with @traceable. The LangSmith SDK is model-agnostic — it doesn't require LangChain at all.
Conclusion
LangSmith transforms agent debugging from log archaeology to interactive investigation. The combination of automatic LangChain tracing, the @traceable decorator for custom code, dataset evaluation for regression testing, and production monitoring creates a complete observability platform for LLM applications. For teams shipping agents to production, LangSmith reduces debugging time on production incidents by 80%+ — the visual call tree makes failures immediately obvious that would take hours to trace through logs.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.