AI Evaluation: Measuring Performance
Dec 29, 2025 β’ 20 min read
You cannot improve what you cannot measure. "Vibe checking" your LLM β running a few manual tests and calling it good β is not engineering. It's guesswork that will hurt you when you switch models, change your retrieval strategy, or deploy an update that silently regresses quality. Evals are automated test suites that give you a reproducible quality score you can track over time.
1. Why AI Evals Are Different from Unit Tests
Traditional software has deterministic outputs β given input X, you always get output Y. LLMs are probabilistic. The same prompt can produce different outputs on each run, and "correct" is often subjective. This requires a different testing philosophy:
- No exact match assertions:
assert output == expectedfails for free-form text. You need semantic similarity or LLM-graded correctness. - Statistical evaluation: Run each test case multiple times and check that quality scores exceed a threshold (e.g., faithfulness > 0.85 on 90% of runs).
- Human-labeled golden sets: You need ground-truth Q&A pairs with model answers to evaluate against. This dataset is your most valuable asset.
- Regression tracking: Track eval scores over time β a prompt change that improves one metric often hurts another.
2. LLM-as-a-Judge: The Core Pattern
The most scalable evaluation technique: use a strong model (GPT-4o) to grade the outputs of your weaker/cheaper model. It captures nuance that regex or string matching cannot:
from openai import OpenAI
import json
client = OpenAI()
def llm_judge(question: str, context: str, answer: str) -> dict:
"""
Uses GPT-4o to grade an answer on multiple dimensions.
Returns scores 0-1 for faithfulness, relevance, and completeness.
"""
prompt = f"""You are an expert evaluator. Grade this AI answer on three criteria.
QUESTION: {question}
CONTEXT PROVIDED TO AI: {context}
AI ANSWER: {answer}
Respond in JSON:
{{
"faithfulness": <0-1, is the answer factually supported by the context?>,
"relevance": <0-1, does the answer address the question?>,
"completeness": <0-1, is the answer complete or missing key info?>,
"reasoning": "<one sentence explaining the scores>"
}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Grade a set of Q&A pairs
test_cases = [
{"q": "What is our refund policy?", "context": "Refunds within 30 days.", "a": "You can get a refund within 30 days of purchase."},
{"q": "Do you offer student discounts?", "context": "Enterprise pricing available.", "a": "Yes, we offer 50% student discounts."}, # Hallucination!
]
for case in test_cases:
scores = llm_judge(case["q"], case["context"], case["a"])
print(f"Faithfulness: {scores['faithfulness']:.2f} | {case['a'][:50]}")3. RAG-Specific Metrics with Ragas
Ragas is the standard library for evaluating RAG pipelines. It measures quality at every stage of retrieval and generation:
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Is the answer grounded in retrieved context?
answer_relevancy, # Does the answer address the question?
context_recall, # Did retrieval find the right documents?
context_precision, # Are retrieved docs actually useful?
)
from datasets import Dataset
# Your test data (golden Q&A pairs + actual RAG outputs)
eval_data = {
"question": ["What is the return policy?", "How do I cancel?"],
"answer": ["30-day returns accepted.", "Cancel via Settings > Account."], # Your RAG output
"contexts": [["Our policy: 30-day returns..."], ["To cancel your account..."]], # Retrieved docs
"ground_truth": ["Items can be returned within 30 days.", "Go to Settings then Account to cancel."], # Human labels
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
print(results)
# Output: {'faithfulness': 0.89, 'answer_relevancy': 0.91, 'context_recall': 0.85, 'context_precision': 0.78}Understanding Ragas Metrics
| Metric | What It Measures | Low Score Means⦠|
|---|---|---|
| Faithfulness | Is the answer grounded in context? | Model is hallucinating β making up facts |
| Answer Relevancy | Does the answer address the question? | Model is off-topic or too vague |
| Context Recall | Did retrieval find the right docs? | Retrieval is missing key documents |
| Context Precision | Are retrieved docs actually useful? | Too much noise in retrieval results |
4. Building a Golden Eval Dataset
Your eval dataset is the foundation of all quality measurement. Build it carefully:
# Step 1: Generate question-answer pairs from your documents
def generate_eval_set(documents: list[str], n_per_doc: int = 5) -> list[dict]:
eval_pairs = []
for doc in documents:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Generate {n_per_doc} Q&A pairs from this document.
These will be used to test a RAG system. Include edge cases.
Format: [{{"question": "...", "ground_truth_answer": "..."}}]
Document: {doc}"""
}],
response_format={"type": "json_object"}
)
pairs = json.loads(response.choices[0].message.content)
eval_pairs.extend(pairs)
# Step 2: Human review β remove bad/ambiguous examples
# Save to file for reproducibility
with open("eval_dataset.json", "w") as f:
json.dump(eval_pairs, f, indent=2)
return eval_pairs5. CI/CD Integration: Automated Quality Gates
Run evals automatically on every pull request and block merges that regress quality below your thresholds:
# .github/workflows/eval.yml
name: RAG Quality Gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run RAG Evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pip install ragas datasets
python scripts/run_evals.py
# scripts/run_evals.py
THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_recall": 0.75,
}
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
failed = []
for metric, threshold in THRESHOLDS.items():
score = results[metric]
if score < threshold:
failed.append(f"{metric}: {score:.2f} < {threshold}")
print(f"β FAILED: {metric} = {score:.2f} (threshold: {threshold})")
else:
print(f"β
PASSED: {metric} = {score:.2f}")
if failed:
print(f"\nEval gate FAILED. Fix these issues before merging.")
exit(1) # Non-zero exit blocks the merge
print("\nβ
All eval gates passed.")6. TruLens: Comprehensive Agent Evaluation
TruLens provides a broader evaluation framework for LLM applications with a built-in dashboard:
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens.apps.langchain import TruChain
session = TruSession()
provider = TruOpenAI()
# Define feedback functions
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons) \
.on_input_output()
f_qa_relevance = Feedback(provider.relevance_with_cot_reasons) \
.on_input_output()
# Wrap your LangChain RAG chain
tru_recorder = TruChain(
your_rag_chain,
app_name="Production RAG",
feedbacks=[f_groundedness, f_qa_relevance],
)
# Run through your test set β TruLens records everything
with tru_recorder as recording:
for question in test_questions:
your_rag_chain.invoke({"input": question})
# Launch the dashboard
session.run_dashboard() # Opens at http://localhost:8501Frequently Asked Questions
How many test cases do I need?
Minimum 50-100 for statistical reliability. Your eval scores will have high variance below 50 examples. For production systems, aim for 200-500 labeled examples covering: typical queries, edge cases, adversarial inputs, and domain-specific terminology. Scale as your product grows.
How often should I run evals?
On every code change via CI/CD for regression detection. Weekly on a fresh sample of production queries to detect real-world drift. After any model, prompt, or retrieval change. Before any major release.
What score thresholds should I target?
For faithfulness: 0.85+ is good, 0.90+ is production-ready. For answer relevancy: 0.80+. For context recall: 0.75+. These are starting points β calibrate against human preferences for your specific use case by running LLM scores alongside a human eval on 50 examples.
Conclusion
Building an evaluation pipeline transforms AI development from art to engineering. You can now iterate confidently: change your chunking strategy, swap your embedding model, or optimize your prompts β knowing immediately whether quality improved or regressed. The teams shipping the most reliable AI products aren't the ones with the best models; they're the ones with the best evals.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.