LLMOps: From Prototype to Production
Dec 30, 2025 • 20 min read
Building an AI agent demo that works 80% of the time in a Jupyter notebook takes a weekend. Building one that serves 10,000 concurrent users at under 500ms latency, costs less than $0.01 per query, doesn't hallucinate on edge cases, and fails gracefully under adversarial inputs — that's LLMOps. The gap between demo and production in AI systems is far larger than in traditional software, because LLM failures are probabilistic, context-dependent, and often silent. This guide covers the operational tooling you need to bridge that gap.
1. Tracing: The X-Ray for Agent Failures
# LangSmith: Automatic tracing for LangChain and any LLM calls
# Captures: inputs, outputs, latency, token counts, tool calls, and errors
# for every step in your agent's execution
pip install langsmith
# 1. Setup (just environment variables — no code changes needed for LangChain)
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__your_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-production-agent" # Groups runs in dashboard
# 2. Your existing LangChain code now automatically traces every step
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
llm = ChatOpenAI(model="gpt-4o")
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
# This call's entire trace is visible in LangSmith dashboard:
result = executor.invoke({"input": "What's the weather in Paris?"})
# Dashboard shows:
# - User input → retrieval chunks → constructed prompt → LLM response
# - Which tools were called and with what arguments
# - Token usage and cost breakdown per step
# - Total latency with per-step breakdown
# 3. Manual tracing for non-LangChain code
from langsmith import traceable
@traceable(name="custom_rag_pipeline", run_type="chain")
def my_custom_rag(query: str) -> str:
"""Any function decorated with @traceable is automatically traced."""
chunks = retrieve_from_vector_db(query)
@traceable(name="llm_synthesis", run_type="llm")
def synthesize(context: str, question: str) -> str:
return openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Context: {context}
{question}"}]
).choices[0].message.content
return synthesize('
'.join(chunks), query)
# 4. Add custom metadata to traces for filtering in dashboard
from langsmith import get_current_run_tree
run = get_current_run_tree()
if run:
run.metadata["user_id"] = "user_12345"
run.metadata["session_id"] = "session_abc"
run.metadata["environment"] = "production"2. Evaluations: CI/CD for AI Quality
pip install ragas datasets
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Does the answer stay true to retrieved context?
answer_relevancy, # Does the answer address the question?
context_recall, # Did retrieval fetch the right documents?
context_precision, # Was retrieved context actually used in the answer?
)
from datasets import Dataset
# Build your evaluation dataset (50-200 Q&A pairs with golden answers)
eval_data = {
"question": [
"What is our refund policy?",
"How long does shipping take?",
"Do you offer volume discounts?",
],
"answer": [
your_agent.query("What is our refund policy?"),
your_agent.query("How long does shipping take?"),
your_agent.query("Do you offer volume discounts?"),
],
"contexts": [
retrieve_context("What is our refund policy?"),
retrieve_context("How long does shipping take?"),
retrieve_context("Do you offer volume discounts?"),
],
"ground_truth": [
"30-day full refund policy, no questions asked",
"3-5 business days standard, 1-2 days express",
"10% off for orders over $500, 20% off for orders over $2000",
],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset=dataset, metrics=[
faithfulness, answer_relevancy, context_recall, context_precision
])
print(results)
# {'faithfulness': 0.94, 'answer_relevancy': 0.88,
# 'context_recall': 0.91, 'context_precision': 0.87}
# CI/CD integration: Fail deployment if quality drops
THRESHOLDS = {
"faithfulness": 0.90, # <90% means model is hallucinating
"answer_relevancy": 0.85, # <85% means answers are drifting off-topic
"context_recall": 0.88, # <88% means retrieval is missing relevant docs
}
for metric, threshold in THRESHOLDS.items():
score = results[metric]
if score < threshold:
raise ValueError(f"EVAL FAILURE: {metric} = {score:.2f} (threshold: {threshold})")
# This blocks deployment if quality regresses after prompt/model changes
print("All evals passed — safe to deploy")3. Semantic Caching: 90% Latency and Cost Reduction
pip install redis langchain-community openai
from langchain.cache import RedisSemanticCache
from langchain.embeddings import OpenAIEmbeddings
import langchain
import redis
# Traditional exact-match caching doesn't work for LLMs:
# "Who is Apple's CEO?" and "Who runs Apple?" are different strings but semantically identical
# Semantic caching uses vector similarity to find equivalent queries:
redis_client = redis.Redis(host="localhost", port=6379)
embeddings = OpenAIEmbeddings()
# Set global LangChain cache — automatically applied to all LLM calls
langchain.llm_cache = RedisSemanticCache(
redis_url="redis://localhost:6379",
embedding=embeddings,
score_threshold=0.95, # Similarity threshold: 0.95 = must be 95% similar to use cache
# Lower = more cache hits, higher false positive rate
# Higher = fewer hits, more accurate
)
# Now every LLM call checks the cache first:
llm = ChatOpenAI(model="gpt-4o")
# First call: cache miss → LLM called, result cached (~1000ms, $0.02)
response1 = llm.invoke("Who is the CEO of Apple?")
# Second call with paraphrased question: cache HIT → instant, free!
response2 = llm.invoke("Who runs Apple Inc.?") # Same answer served from cache
response3 = llm.invoke("Who is Apple's current chairman?") # Near hit = still fast
# For high-volume applications, also implement application-level caching:
from functools import lru_cache
import hashlib
def semantic_cache_key(query: str) -> str:
"""Create a cache key by clustering similar queries."""
# Simple version: hash the query
# Advanced version: embed and round to nearest cluster centroid
return hashlib.sha256(query.encode()).hexdigest()
# Cost math example:
# Without semantic cache: 1000 queries/day × $0.02/query = $20/day = $600/month
# With semantic cache (60% hit rate): 400 × $0.02 = $8/day = $240/month
# Semantic cache server cost: ~$20/month
# Net saving: $340/month from caching alone4. Guardrails: Protecting Against Misuse
pip install nemoguardrails
# NeMo Guardrails: Programmatic input/output filtering with LLM-based evaluation
# config/rails.co (Colang config language):
"""
define user ask about politics
"what do you think about the election?"
"tell me about political parties"
define bot refuse to discuss politics
"I'm a product support assistant and can only help with product-related questions."
define flow
user ask about politics
bot refuse to discuss politics
"""
from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
# Usage in your FastAPI endpoint
from fastapi import FastAPI
app = FastAPI()
@app.post("/chat")
async def chat_endpoint(user_input: str, user_id: str):
# Rate limiting
if get_request_count(user_id, window_minutes=60) > 50:
return {"error": "Rate limit exceeded: 50 requests per hour"}
# Guardrails evaluation (uses LLM to classify intent)
response = await rails.generate_async(
messages=[{"role": "user", "content": user_input}]
)
# Cost tracking
log_token_usage(user_id, response.get("usage", {}))
return {"response": response["content"]}Frequently Asked Questions
How do I calculate the ROI of semantic caching vs always calling the LLM?
Measure your cache hit rate after 1 week of operation (Redis provides cache hit stats). For enterprise support bots, hit rates of 40-70% are common — users ask similar questions about the same features repeatedly. At a 50% hit rate on a workload of 10,000 queries/day at $0.02/query, you save $100/day = $3,000/month. Redis Cluster costs $100-500/month. Net ROI is usually 5-10x in the first month. The semantic similarity threshold is the key parameter to tune: measure the false positive rate (similar questions with different correct answers) and adjust to minimize them.
What's the right evaluation cadence for production systems?
Run automated evals on every pull request that changes prompts, models, or retrieval logic — treat eval failure as a blocking error, not a warning. Additionally, run evals weekly against a "canary" dataset that includes examples of known failure modes discovered in production. For regression detection, maintain a "golden dataset" of cases where your system previously failed and was fixed — if any of these regress, block deployment. The goal is making eval failures cheap to discover (in CI) rather than expensive (in production).
Conclusion
LLMOps is the engineering discipline that separates AI demos from AI products. Tracing with LangSmith makes every agent decision auditable and debuggable. Evaluation frameworks (RAGAS) provide quantitative quality gates that prevent regressions from reaching production. Semantic caching can reduce both latency and cost by 50-70% for high-volume applications. Guardrails protect against topic drift, prompt injection, and adversarial misuse. Together, these tools transform probabilistic AI behavior into a manageable operational system with measurable quality metrics and cost controls.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.