⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

The Data Flywheel

Dec 30, 2025 • 20 min read

You launch your AI product. Accuracy is 70%. Without feedback capture, it stays 70% forever — you'll manually review anecdotal complaints and occasionally tune your prompt, but nothing systematic improves. With a feedback loop, every user interaction becomes a training signal. Users rate answers. You identify failure patterns. You fine-tune. The model improves. More users trust it. They provide more feedback. This is the data flywheel — the compounding advantage that separates AI products that continuously improve from those that stagnate.

1. Explicit vs Implicit Feedback

Explicit Feedback

👍/👎 thumbs rating on responses
1-5 star rating widgets
"Was this helpful?" Yes/No
Free-text correction input ("This should say...")
Annotation tasks in Argilla/LabelStudio

Implicit Feedback

User copies code block → strong positive signal
User re-phrases their question → answer was unclear
Session ends immediately → answer was unhelpful
User clicks cited source → wants verification
User follows up → partial success (answered but incomplete)

2. Database Schema for Feedback

-- PostgreSQL schema for capturing feedback linked to specific runs
CREATE TABLE llm_runs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    session_id UUID,
    model VARCHAR(100) NOT NULL,           -- "gpt-4o", "llama-3.1-8b" etc.
    trace_id VARCHAR(255),                 -- LangSmith trace ID for debugging
    
    -- Inputs
    system_prompt_hash VARCHAR(64),        -- Hash of system prompt (detect prompt changes)
    user_query TEXT NOT NULL,
    context_chunks JSONB,                  -- Chunks retrieved for RAG
    
    -- Outputs  
    response TEXT NOT NULL,
    prompt_tokens INTEGER,
    completion_tokens INTEGER,
    cost_usd DECIMAL(10, 6),
    latency_ms INTEGER,
    
    -- Metadata
    feature VARCHAR(100),                  -- "doc_qa", "code_gen", "chat" etc.
    model_version VARCHAR(50),             -- "v1.2" for fine-tuned models
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE feedback (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    run_id UUID NOT NULL REFERENCES llm_runs(id),
    user_id UUID REFERENCES users(id),
    
    -- Explicit feedback
    score INTEGER,                         -- 1 = positive, -1 = negative, NULL = no rating
    comment TEXT,                          -- User's free-text explanation
    correction TEXT,                       -- What the correct answer should be
    
    -- Implicit signals (log separately, aggregate later)
    implicit_signal VARCHAR(50),           -- 'copy_code', 'rephrase', 'session_end', 'follow_up'
    
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Useful indexes for analysis
CREATE INDEX idx_feedback_run_id ON feedback(run_id);
CREATE INDEX idx_feedback_score ON feedback(score) WHERE score IS NOT NULL;
CREATE INDEX idx_runs_feature ON llm_runs(feature, created_at DESC);

3. Capturing Feedback in Your API

# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
from uuid import UUID
import asyncpg

app = FastAPI()

class GenerateRequest(BaseModel):
    query: str
    session_id: str

class FeedbackRequest(BaseModel):
    run_id: UUID
    score: int  # 1 or -1
    comment: str | None = None
    correction: str | None = None  # "The answer should have been..."

@app.post("/api/generate")
async def generate(req: GenerateRequest, db: asyncpg.Connection):
    # Generate response
    response, usage = await llm_generate(req.query)
    
    # Log the run — returns ID for feedback linking
    run_id = await db.fetchval("""
        INSERT INTO llm_runs (user_id, model, user_query, response, 
                               prompt_tokens, completion_tokens, cost_usd, latency_ms, feature)
        VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
        RETURNING id
    """, req.session_id, "gpt-4o", req.query, response,
        usage.prompt_tokens, usage.completion_tokens, calculate_cost(usage), latency, "doc_qa")
    
    # Return run_id to frontend — frontend stores it for feedback submission
    return {"response": response, "run_id": str(run_id)}

@app.post("/api/feedback")
async def submit_feedback(req: FeedbackRequest, db: asyncpg.Connection):
    await db.execute("""
        INSERT INTO feedback (run_id, score, comment, correction)
        VALUES ($1, $2, $3, $4)
    """, req.run_id, req.score, req.comment, req.correction)
    return {"status": "recorded"}

4. Mining Feedback for DPO Training Pairs

import asyncpg
import json

async def build_dpo_dataset(db, output_file: str = "dpo_pairs.jsonl"):
    """
    Build Direct Preference Optimization (DPO) training pairs from feedback.
    DPO format: {prompt, chosen_response, rejected_response}
    """
    # Find run pairs: same query, different sessions, one liked and one disliked
    # Or: use correction field as the "chosen" response
    
    # Strategy 1: Correction-based pairs (highest quality signal)
    correction_pairs = await db.fetch("""
        SELECT 
            r.user_query as prompt,
            f.correction as chosen,   -- Human-corrected answer
            r.response as rejected    -- Original (bad) AI response
        FROM feedback f
        JOIN llm_runs r ON f.run_id = r.id
        WHERE f.score = -1            -- User thumbed down
          AND f.correction IS NOT NULL -- And provided a correction
          AND length(f.correction) > 50 -- Correction is substantive
        ORDER BY f.created_at DESC
        LIMIT 10000
    """)
    
    # Strategy 2: Paired positive/negative (same question, different answers)
    paired_outcomes = await db.fetch("""
        SELECT 
            a.user_query as prompt,
            b.response as chosen,
            a.response as rejected
        FROM llm_runs a
        JOIN feedback fa ON fa.run_id = a.id AND fa.score = -1
        JOIN llm_runs b ON b.user_query = a.user_query
        JOIN feedback fb ON fb.run_id = b.id AND fb.score = 1
        WHERE a.id != b.id
          AND a.feature = b.feature
        LIMIT 5000
    """)
    
    # Write DPO dataset
    with open(output_file, "w") as f:
        for pair in list(correction_pairs) + list(paired_outcomes):
            f.write(json.dumps({
                "prompt": pair["prompt"],
                "chosen": pair["chosen"],
                "rejected": pair["rejected"],
            }) + "\n")
    
    print(f"Created {len(correction_pairs) + len(paired_outcomes)} DPO pairs")

5. Running DPO Fine-Tuning

pip install trl transformers datasets

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Load your collected preference pairs
dataset = load_dataset("json", data_files="dpo_pairs.jsonl")["train"]

dpo_config = DPOConfig(
    output_dir="./llama3-dpo-finetuned",
    beta=0.1,           # DPO temperature — lower = stronger preference alignment
    learning_rate=5e-7, # Very low LR for alignment (much lower than SFT)
    per_device_train_batch_size=4,
    num_train_epochs=3,
    max_length=2048,
    fp16=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Will auto-create a frozen copy for KL divergence
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-dpo-finetuned")
# The fine-tuned model now prefers answers your users prefer!

Frequently Asked Questions

How many preference pairs do I need for DPO fine-tuning?

Start with 1,000 high-quality pairs (using human-corrected responses as "chosen"). You can see meaningful alignment improvements with as few as 500 pairs on a narrow domain. For general-purpose alignment like OpenAI's RLHF training, you need millions — but for a specialized use case (customer support bot, code assistant), 1,000-5,000 domain-specific pairs are highly effective.

What's the difference between DPO and RLHF?

Reinforcement Learning from Human Feedback (RLHF) requires training a separate reward model, then using PPO to optimize the LLM against that reward — a complex, unstable three-stage pipeline. Direct Preference Optimization (DPO) is a more recent technique that achieves the same alignment goals by fine-tuning directly on preference pairs, without a reward model or RL. DPO is simpler to implement, more stable to train, and achieves comparable results. Always use DPO for new alignment projects.

Conclusion

The feedback loop is what separates AI products that improve from those that stagnate. The infrastructure is straightforward: a runs table, a feedback table, and a pipeline that periodically mines preference pairs from negative feedback with corrections. Run DPO fine-tuning when you accumulate 1,000+ pairs. Evaluate the new model on your golden test set. If it improves, deploy. Repeat. This is the data flywheel — and it's how today's leading AI products compound their performance advantage over time.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact