opncrafter

LangSmith: X-Ray for Agents

Dec 30, 2025 • 20 min read

Agents loop. Agents retry. Agents call tools, get bad responses, recover, and sometimes spiral into infinite loops. When an agent fails, it usually fails silently — 5 steps deep in a chain with no obvious error trace. LangSmith gives you a visual waterfall diagram of every LLM call, every tool invocation, every token spent, and every intermediate reasoning step in your agent's execution — making debugging that would take hours from logs take minutes from the UI.

1. One-Line Integration

# Sign up free at smith.langchain.com → Create API key

# .env file (or set in your environment)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...your_key
LANGCHAIN_PROJECT=my-rag-agent  # Groups all traces for this project

# That's it! Every LangChain call is now traced automatically.
# Re-run your existing agent code — it sends traces to LangSmith.

# What gets captured automatically:
# ✓ Full prompts (with system prompt, chat history)
# ✓ Model used + temperature + max_tokens
# ✓ Input/output tokens + cost estimate
# ✓ Latency per step
# ✓ Tool names + inputs + outputs
# ✓ Retriever queries + results
# ✓ Full chain call tree (parent → child relationships)
# ✓ Any errors with full tracebacks

2. The @traceable Decorator for Custom Code

You don't need to use LangChain's built-in chains to benefit from LangSmith. The @traceable decorator wraps any Python function:

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

# Trace individual tools
@traceable(run_type="tool", name="search_product_db")
def search_products(query: str, max_results: int = 5) -> list[dict]:
    """This function call will appear as a 'Tool' span in LangSmith."""
    results = product_database.search(query, limit=max_results)
    return results  # LangSmith records the query + results

# Trace the retrieval step separately
@traceable(run_type="retriever", name="rag_retriever")  
def retrieve_chunks(query: str) -> list[str]:
    chunks = vectordb.similarity_search(query, k=5)
    return [c.page_content for c in chunks]

# Trace the main LLM call
@traceable(run_type="llm", name="answer_generation")
def generate_answer(context: list[str], question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer based on the provided context only."
        }, {
            "role": "user",
            "content": f"Context:\n{chr(10).join(context)}\n\nQuestion: {question}"
        }]
    )
    return response.choices[0].message.content

# Top-level chain — LangSmith shows all sub-calls nested under this
@traceable(run_type="chain", name="rag_pipeline")
def rag_query(question: str) -> str:
    # LangSmith shows: rag_pipeline → rag_retriever, answer_generation
    chunks = retrieve_chunks(question)
    answer = generate_answer(chunks, question)
    return answer

3. Debugging Common Agent Failures

# In LangSmith UI: click any failed trace and look for these patterns:

# Failure Pattern 1: Tool Output Parsing Error
# In trace, look for a 'parse_tool_call' span with error
# Root cause: LLM returned non-JSON when you expected JSON
# Fix: Add explicit JSON instruction + response_format={"type": "json_object"}

# Failure Pattern 2: Infinite Tool Loop  
# Look for repeated tool calls in the trace (same tool called 5+ times)
# Root cause: No max_iterations set, LLM stuck in reasoning loop
# Fix:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=10,             # Hard limit on tool calls
    max_execution_time=60,         # Timeout after 60 seconds  
    early_stopping_method="force", # Force an answer after max_iterations
    handle_parsing_errors=True,    # Don't crash on format errors
)

# Failure Pattern 3: Retriever Returns Irrelevant Chunks
# In trace, look at 'retriever' spans → expand "outputs"
# Compare retrieved chunks to the user's actual question
# Fix: Check your embedding model, increase k, add reranking

# Failure Pattern 4: Token Budget Exceeded
# LangSmith shows > 100k input tokens for a single call
# Root cause: Chat history not being truncated
from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=3000,  # Summarize when history exceeds 3k tokens
)

4. Dataset Evaluation (Regression Testing)

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

ls_client = Client()

# Step 1: Create your golden dataset in LangSmith
# (or import from an existing trace by clicking "Add to Dataset")
dataset = ls_client.create_dataset(
    dataset_name="rag-qa-golden-set-v1",
    description="200 curated Q&A pairs with verified correct answers"
)

# Add examples
examples = [
    {"inputs": {"question": "What is our refund policy?"}, 
     "outputs": {"answer": "Our refund policy allows returns within 30 days..."}},
    # ... 199 more examples
]
ls_client.create_examples(inputs=[e["inputs"] for e in examples],
                           outputs=[e["outputs"] for e in examples],
                           dataset_id=dataset.id)

# Step 2: Run evaluation when you change your prompt or model
def run_rag(inputs: dict) -> dict:
    """Your agent wrapped for evaluation."""
    answer = rag_query(inputs["question"])
    return {"answer": answer}

# Built-in evaluators
correctness_evaluator = LangChainStringEvaluator("labeled_criteria", config={
    "criteria": {"correctness": "Is this answer factually correct based on the reference?"}
})

results = evaluate(
    run_rag,
    data="rag-qa-golden-set-v1",
    evaluators=[correctness_evaluator],
    experiment_prefix="gpt4o-vs-haiku",  # Compare experiments in LangSmith UI
    num_repetitions=1,
)

5. Production Monitoring: Quality Drift Alerts

# LangSmith monitors production traces automatically
# Set up alerts in the UI: Rules → Create Rule

# Example alert rules:
# 1. "Alert if error rate > 5% in any 1-hour window"
# 2. "Alert if median latency > 5 seconds for 10 consecutive traces"  
# 3. "Alert if LLM output length drops below 50 tokens" (truncation sign)

# Add user feedback to production traces
from langsmith import Client
ls_client = Client()

# Your API route: POST /api/feedback
async def record_feedback(trace_id: str, score: int, comment: str):
    ls_client.create_feedback(
        run_id=trace_id,         # From X-LangSmith-Trace response header
        key="user_satisfaction",
        score=score,             # 0 or 1 (thumbs down/up)
        comment=comment,
    )
# LangSmith shows feedback % in the dashboard — track quality over time

Frequently Asked Questions

Is LangSmith free?

LangSmith offers a generous free Developer tier with 3,000 traces/month and 1 project. The Team plan ($39/seat/month) adds unlimited traces, custom evaluators, annotation queues, and team sharing. For monitoring production AI applications, the Team plan is essentially mandatory — 3,000 traces/month is consumed in a few hours by any real application.

Does LangSmith work with non-LangChain code?

Yes — the @traceable decorator works with any Python code: raw OpenAI calls, Anthropic calls, custom retrieval code, custom agents. You can even trace LlamaIndex pipelines by wrapping them with @traceable. The LangSmith SDK is model-agnostic — it doesn't require LangChain at all.

Conclusion

LangSmith transforms agent debugging from log archaeology to interactive investigation. The combination of automatic LangChain tracing, the @traceable decorator for custom code, dataset evaluation for regression testing, and production monitoring creates a complete observability platform for LLM applications. For teams shipping agents to production, LangSmith reduces debugging time on production incidents by 80%+ — the visual call tree makes failures immediately obvious that would take hours to trace through logs.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK