⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Observability: Tracing the Chain

Dec 30, 2025 • 20 min read

A RAG pipeline isn't a single function call — it's a chain: User → Router → Query Rewriter → Embedding → Vector DB → Reranker → LLM → Parser. If the full pipeline takes 8 seconds, which step is the bottleneck? If the answer quality is poor, is it the retrieval or the generation? Log-based debugging can't answer these questions. Distributed tracing gives you a waterfall chart of every step's latency, inputs, outputs, and token costs.

1. The Three Pillars of LLM Observability

Traces: End-to-end execution paths through your pipeline. Captures the full sequence of operations for a single user request.
Spans: Individual segments within a trace (e.g., "vector_search", "llm_generation"). Each span has a start time, duration, inputs, outputs, and metadata.
Metrics: Aggregated measurements over time — P50/P95/P99 latency, token costs per request, cache hit rates, error rates.

2. OpenTelemetry (OTEL): Instrument Your Code

OpenTelemetry is the vendor-neutral standard for distributed tracing. Instrument your RAG pipeline by wrapping each step in a span:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup — send traces to Jaeger or any OTLP-compatible backend
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

def rag_pipeline(user_query: str) -> str:
    # Root span — represents the entire user request
    with tracer.start_as_current_span("rag_request") as root_span:
        root_span.set_attribute("user.query", user_query)
        root_span.set_attribute("session.id", "sess_abc123")

        # Span 1: Query embedding
        with tracer.start_as_current_span("embed_query") as embed_span:
            query_embedding = embedding_model.encode(user_query)
            embed_span.set_attribute("model", "text-embedding-3-small")
            embed_span.set_attribute("embedding.dim", len(query_embedding))

        # Span 2: Vector search
        with tracer.start_as_current_span("vector_search") as search_span:
            docs = chroma_collection.query(
                query_embeddings=[query_embedding],
                n_results=5
            )
            retrieved = docs["documents"][0]
            search_span.set_attribute("docs.retrieved", len(retrieved))
            search_span.set_attribute("collection.name", "product_docs")
            search_span.set_attribute("top_score", docs["distances"][0][0])

        # Span 3: Context reranking (optional)
        with tracer.start_as_current_span("reranking") as rerank_span:
            reranked_docs = reranker.rerank(user_query, retrieved)
            rerank_span.set_attribute("docs.after_rerank", len(reranked_docs))

        # Span 4: LLM generation
        with tracer.start_as_current_span("llm_generation") as llm_span:
            context = "\n".join(reranked_docs)
            response = openai_client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Answer based on context: " + context},
                    {"role": "user", "content": user_query}
                ]
            )
            answer = response.choices[0].message.content
            llm_span.set_attribute("model", "gpt-4o")
            llm_span.set_attribute("tokens.prompt", response.usage.prompt_tokens)
            llm_span.set_attribute("tokens.completion", response.usage.completion_tokens)
            llm_span.set_attribute("cost.usd", response.usage.total_tokens * 0.000005)

        root_span.set_attribute("answer.length", len(answer))
        return answer

3. LangSmith: Zero-Config Tracing for LangChain Apps

If you use LangChain, LangSmith provides automatic tracing with zero instrumentation code:

# Set these environment variables before importing LangChain — that's it
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_PROJECT"] = "production-rag-v2"

# Now ALL LangChain calls are automatically traced to LangSmith
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o")
vectorstore = Chroma(...)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# This call is now fully traced — you'll see every step in LangSmith dashboard
result = qa_chain.invoke({"query": "What is the return policy?"})

# LangSmith captures: input, all intermediate steps, retrieved docs, 
# LLM prompt, response, tokens used, latency at each step, errors

4. Arize Phoenix: Local Trace Visualization

Arize Phoenix is a free, open-source, local trace viewer designed specifically for LLM apps:

pip install arize-phoenix openinference-instrumentation-openai

import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk

# Start local Phoenix server
session = px.launch_app()  # Opens at http://localhost:6006

# Auto-instrument OpenAI calls (also supports LangChain, LlamaIndex)
tracer_provider = trace_sdk.TracerProvider()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
trace_api.set_tracer_provider(tracer_provider)

# Now all OpenAI calls appear in Phoenix UI with:
# - Full prompt text and response
# - Token counts and costs
# - Latency breakdown
# - Embeddings visualized in 2D (UMAP projection)
# - RAG document retrieval context

5. Setting Up Jaeger for OTEL Traces

# docker-compose.yml — local Jaeger setup
services:
  jaeger:
    image: jaegertracing/all-in-one:1.52
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC receiver
      - "4318:4318"    # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

# Access Jaeger UI at http://localhost:16686
# You'll see a waterfall chart like:
# rag_request ──────────────────────────── 4,850ms
# ├── embed_query ──── 120ms
# ├── vector_search ── 90ms
# ├── reranking ────── 340ms
# └── llm_generation ──────────────────── 4,300ms  ← BOTTLENECK

6. Prometheus Metrics for Aggregated Monitoring

from prometheus_client import Histogram, Counter, Gauge
import time

# Define metrics
rag_latency = Histogram(
    'rag_request_duration_seconds',
    'End-to-end RAG pipeline latency',
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
rag_token_cost = Counter('rag_token_cost_usd', 'Total LLM token cost in USD')
retrieval_quality = Histogram('rag_top_doc_score', 'Top retrieved document relevance score')

def instrumented_rag(query: str) -> str:
    start = time.time()
    
    docs = vectorstore.similarity_search_with_score(query, k=5)
    top_score = docs[0][1] if docs else 0
    retrieval_quality.observe(top_score)
    
    response = llm.invoke(build_prompt(query, docs))
    tokens = response.usage_metadata.get("total_tokens", 0)
    rag_token_cost.inc(tokens * 0.000005)
    
    duration = time.time() - start
    rag_latency.observe(duration)
    
    return response.content

7. What to Alert On

P95 latency > 5 seconds: Users abandon requests. Alert and investigate the slowest steps in traces.
Top retrieval score < 0.6: The query returned poor matches — possible retrieval failure or missing documents.
LLM error rate > 2%: Rate limit hits, context length exceeded, or model outages.
Cost per request > $0.05: Unusual prompt lengths or excessive tool calls — investigate trace.
Cache hit rate < 30%: Semantic cache not effective — tune similarity threshold.

Frequently Asked Questions

LangSmith vs Phoenix vs Jaeger — which should I use?

LangSmith for LangChain-based apps — zero setup, best LangChain-specific features but SaaS (data leaves your infrastructure). Arize Phoenix for privacy-sensitive local development or open-source OTEL traces. Jaeger for production deployments with existing OTEL infrastructure — integrates with Grafana and Prometheus.

How do I trace across microservices?

Use OTEL's trace context propagation headers. The parent span's trace ID is passed in the traceparent HTTP header to downstream services, allowing the full cross-service trace to be correlated in Jaeger. LangChain handles this automatically when you configure OTEL.

Conclusion

Observability transforms debugging AI pipelines from guesswork into a systematic process. A waterfall trace that shows vector search taking 90ms and LLM generation taking 4,300ms immediately tells you where to focus optimization effort. As your RAG system scales to thousands of users, the latency histograms and cost counters in Prometheus become essential tools for capacity planning and cost management. Don't wait for users to report slowness — build observability in from day one.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact