opncrafter

Evaluating RAG Pipelines

You can't improve what you don't measure. In AI Engineering, "it vibes well" is not a metric.

1. The "RAG Triad"

To thoroughly evaluate a retrieval system, you need to check multiple relationships between your Query, Context, and Answer.

Faithfulness

Context ➜ Answer

Is the answer grounded in the retrieved documents? If the model says "Sky is green" but the docs say "Sky is blue", Faithfulness is 0.

Answer Relevance

Query ➜ Answer

Does the answer actually address the user's question? A faithful answer ("The sky is blue") is useless if the user asked "What is the capital of France?".

Context Precision

Query ➜ Context

Did the retrieval system find the correct documents? And crucially, are the most relevant chunks ranked at the top?

2. How Ragas Works (LLM-as-a-Judge)

Ragas uses an LLM (like GPT-4) to grade your RAG pipeline.

For example, to calculate Faithfulness, Ragas will ask the Judge LLM:"Extract all claims from the generated answer, and verify if each claim is present in the source context."

This produces a score from 0.0 to 1.0, giving you a quantitative metric to track over time.


3. Implementation Guide

Let's evaluate a simple dataset.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# 1. Prepare your data
# ideally collected from your production logs
data = {
    'question': ['How do I reset my password?'],
    'answer': ['Go to settings and click reset.'],
    'contexts': [['Settings page has a reset button...', 'Login page info...']],
    'ground_truth': ['Navigate to Settings > Security > Reset Password.']
}

dataset = Dataset.from_dict(data)

# 2. Run Evaluation
results = evaluate(
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

# 3. View Scores
print(results)
# {'context_precision': 0.85, 'faithfulness': 0.92, ...}

Common Pitfalls in RAG Evaluation

  • Using Weak Judges: Do not use GPT-3.5 or Llama-8B as a judge. They are not smart enough to catch hallucinations. Use GPT-4o or Claude 3.5 Sonnet.
  • Reference-Free Eval: Ragas can evaluate without "ground truth" answers, but having human-verified ground truth (Gold Dataset) always yields better metrics (Context Recall).
  • Cherry-picking Questions: Only evaluating easy questions skews your metrics upward. Include adversarial questions ("What is the company's unreported revenue?") that should return "I don't know" to test for over-answering.
  • Ignoring Context Recall: Low Context Recall (the retriever is missing relevant docs) is often the primary cause of bad answers, but teams focus on Faithfulness. Always check retrieval quality first.

4. Understanding Each Metric in Depth

Faithfulness Score

Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context. Ragas implements this using a two-step LLM procedure:

  1. Claim Extraction: The judge LLM reads the generated answer and extracts a list of atomic claims (e.g., "The API rate limit is 60 requests/minute", "Rate limits reset every hour").
  2. Claim Verification: Each claim is individually checked against the retrieved context. The Faithfulness score = (verified claims) / (total claims).

A score of 1.0 means every statement in the answer is supported by the retrieved documents. A score of 0.5 means half the claims are hallucinated, which is a serious reliability problem in production.

Context Precision

Context Precision focuses on the ordering of retrieved chunks. If your retriever returns 10 chunks but the 3 most relevant ones are ranked 8th, 9th, and 10th, the LLM may never read them (due to "lost in the middle" LLM attention patterns). Context Precision rewards systems that rank the most relevant chunks first.

Low Context Precision usually indicates a weak embedding model or a retrieval algorithm that needs tuning. Consider switching from cosine similarity to a hybrid BM25 + embedding approach.

Context Recall

Context Recall requires a ground truth answer. It checks: "What proportion of the information needed to answer this question was actually retrieved?" A score of 0.7 means 70% of necessary information was retrieved—the other 30% was in the corpus but the retriever missed it.

This metric helps identify when your chunk size or number of retrieved chunks (top_k) is too small. If Context Recall is low, try increasing top_k from 3 to 5 or 10.

Answer Relevance

Answer Relevance checks if the generated answer actually addresses the question asked. Ragas measures this by reverse-engineering: it asks the judge LLM to generate N questions that the answer could be responding to, then measures how similar those generated questions are to the original. High similarity = the answer is relevant.

5. System-Level Evaluation with Continuous Monitoring

One-time evaluation is not enough. Production RAG systems need continuous monitoring as:

  • New documents are added (corpus drift)
  • User query patterns change (distribution shift)
  • Model updates change LLM behavior (model drift)

Building a RAG Evaluation Pipeline

import json
from datetime import datetime
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def run_evaluation_pipeline(rag_system, test_dataset_path: str) -> dict:
    """Run full Ragas evaluation and log results"""
    
    # Load your test questions
    with open(test_dataset_path) as f:
        test_cases = json.load(f)
    
    # Run your RAG system on each question
    results_data = []
    for case in test_cases:
        query = case["question"]
        ground_truth = case["ground_truth"]
        
        # Get RAG response
        response = rag_system.query(query)
        
        results_data.append({
            "question": query,
            "answer": response.answer,
            "contexts": response.retrieved_chunks,
            "ground_truth": ground_truth
        })
    
    dataset = Dataset.from_list(results_data)
    
    # Evaluate all metrics
    scores = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    
    # Log results with timestamp
    evaluation_log = {
        "timestamp": datetime.now().isoformat(),
        "num_questions": len(test_cases),
        "scores": scores.to_dict(),
        "passing": scores["faithfulness"] > 0.85 and scores["answer_relevancy"] > 0.80
    }
    
    return evaluation_log

# Run in CI/CD to catch regressions before deployment
results = run_evaluation_pipeline(my_rag, "test_suite.json")
if not results["passing"]:
    raise ValueError(f"RAG quality check failed: {results['scores']}")

6. Interpreting Your Scores

Score RangeFaithfulnessContext PrecisionAction
0.90 - 1.0🟢 Excellent🟢 ExcellentDeploy to production, monitor
0.75 - 0.90🟡 Good🟡 GoodAcceptable, minor improvements
0.60 - 0.75🟠 Needs work🟠 Needs workImprove prompts and retrieval
Below 0.60🔴 Unacceptable🔴 UnacceptableDebug fundamentals

Frequently Asked Questions

How many test questions do I need?

For initial development, 50-100 questions across your key use cases is sufficient. For production monitoring, aim for 200-500 questions that cover edge cases, ambiguous queries, and adversarial inputs. Quality matters more than quantity—100 carefully crafted questions beat 1000 random ones.

Can I use Ragas without ground truth answers?

Yes. Ragas supports reference-free evaluation for Faithfulness and Answer Relevance, which don't require ground truth. Context Recall and Context Precision require ground truth. Start reference-free for quick feedback, then invest in building a gold dataset for comprehensive evaluation.

How do I build a good test dataset?

Use Ragas's built-in test generation: from ragas.testset import TestsetGenerator. Feed it your docs and it auto-generates diverse Q&A pairs. Then have domain experts review and correct the generated answers. This is faster than writing questions from scratch.

What's a good Faithfulness score for customer-facing apps?

For enterprise or high-stakes applications (legal, medical, finance), target 0.95+. For internal tools or knowledge bases, 0.85+ is acceptable. Never deploy a RAG system with Faithfulness below 0.75 in production—your users will regularly receive hallucinated information.

Next Steps

  • Create Your Gold Dataset: Write 50 question/answer pairs from your actual documents. This is the most valuable investment you can make in RAG quality.
  • Add Ragas to CI/CD: Run evaluations automatically before every deployment to catch quality regressions.
  • Track Metrics Over Time: Store scores in a database and visualize trends to see if quality improves or degrades as you make changes.
  • A/B Test Improvements: Use Ragas to compare different chunking strategies, embedding models, or retrieval approaches with quantitative data.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK