The Data Flywheel
Dec 30, 2025 • 20 min read
You launch your AI product. Accuracy is 70%. Without feedback capture, it stays 70% forever — you'll manually review anecdotal complaints and occasionally tune your prompt, but nothing systematic improves. With a feedback loop, every user interaction becomes a training signal. Users rate answers. You identify failure patterns. You fine-tune. The model improves. More users trust it. They provide more feedback. This is the data flywheel — the compounding advantage that separates AI products that continuously improve from those that stagnate.
1. Explicit vs Implicit Feedback
- 👍/👎 thumbs rating on responses
- 1-5 star rating widgets
- "Was this helpful?" Yes/No
- Free-text correction input ("This should say...")
- Annotation tasks in Argilla/LabelStudio
- User copies code block → strong positive signal
- User re-phrases their question → answer was unclear
- Session ends immediately → answer was unhelpful
- User clicks cited source → wants verification
- User follows up → partial success (answered but incomplete)
2. Database Schema for Feedback
-- PostgreSQL schema for capturing feedback linked to specific runs
CREATE TABLE llm_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
session_id UUID,
model VARCHAR(100) NOT NULL, -- "gpt-4o", "llama-3.1-8b" etc.
trace_id VARCHAR(255), -- LangSmith trace ID for debugging
-- Inputs
system_prompt_hash VARCHAR(64), -- Hash of system prompt (detect prompt changes)
user_query TEXT NOT NULL,
context_chunks JSONB, -- Chunks retrieved for RAG
-- Outputs
response TEXT NOT NULL,
prompt_tokens INTEGER,
completion_tokens INTEGER,
cost_usd DECIMAL(10, 6),
latency_ms INTEGER,
-- Metadata
feature VARCHAR(100), -- "doc_qa", "code_gen", "chat" etc.
model_version VARCHAR(50), -- "v1.2" for fine-tuned models
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE feedback (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
run_id UUID NOT NULL REFERENCES llm_runs(id),
user_id UUID REFERENCES users(id),
-- Explicit feedback
score INTEGER, -- 1 = positive, -1 = negative, NULL = no rating
comment TEXT, -- User's free-text explanation
correction TEXT, -- What the correct answer should be
-- Implicit signals (log separately, aggregate later)
implicit_signal VARCHAR(50), -- 'copy_code', 'rephrase', 'session_end', 'follow_up'
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Useful indexes for analysis
CREATE INDEX idx_feedback_run_id ON feedback(run_id);
CREATE INDEX idx_feedback_score ON feedback(score) WHERE score IS NOT NULL;
CREATE INDEX idx_runs_feature ON llm_runs(feature, created_at DESC);3. Capturing Feedback in Your API
# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
from uuid import UUID
import asyncpg
app = FastAPI()
class GenerateRequest(BaseModel):
query: str
session_id: str
class FeedbackRequest(BaseModel):
run_id: UUID
score: int # 1 or -1
comment: str | None = None
correction: str | None = None # "The answer should have been..."
@app.post("/api/generate")
async def generate(req: GenerateRequest, db: asyncpg.Connection):
# Generate response
response, usage = await llm_generate(req.query)
# Log the run — returns ID for feedback linking
run_id = await db.fetchval("""
INSERT INTO llm_runs (user_id, model, user_query, response,
prompt_tokens, completion_tokens, cost_usd, latency_ms, feature)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
RETURNING id
""", req.session_id, "gpt-4o", req.query, response,
usage.prompt_tokens, usage.completion_tokens, calculate_cost(usage), latency, "doc_qa")
# Return run_id to frontend — frontend stores it for feedback submission
return {"response": response, "run_id": str(run_id)}
@app.post("/api/feedback")
async def submit_feedback(req: FeedbackRequest, db: asyncpg.Connection):
await db.execute("""
INSERT INTO feedback (run_id, score, comment, correction)
VALUES ($1, $2, $3, $4)
""", req.run_id, req.score, req.comment, req.correction)
return {"status": "recorded"}4. Mining Feedback for DPO Training Pairs
import asyncpg
import json
async def build_dpo_dataset(db, output_file: str = "dpo_pairs.jsonl"):
"""
Build Direct Preference Optimization (DPO) training pairs from feedback.
DPO format: {prompt, chosen_response, rejected_response}
"""
# Find run pairs: same query, different sessions, one liked and one disliked
# Or: use correction field as the "chosen" response
# Strategy 1: Correction-based pairs (highest quality signal)
correction_pairs = await db.fetch("""
SELECT
r.user_query as prompt,
f.correction as chosen, -- Human-corrected answer
r.response as rejected -- Original (bad) AI response
FROM feedback f
JOIN llm_runs r ON f.run_id = r.id
WHERE f.score = -1 -- User thumbed down
AND f.correction IS NOT NULL -- And provided a correction
AND length(f.correction) > 50 -- Correction is substantive
ORDER BY f.created_at DESC
LIMIT 10000
""")
# Strategy 2: Paired positive/negative (same question, different answers)
paired_outcomes = await db.fetch("""
SELECT
a.user_query as prompt,
b.response as chosen,
a.response as rejected
FROM llm_runs a
JOIN feedback fa ON fa.run_id = a.id AND fa.score = -1
JOIN llm_runs b ON b.user_query = a.user_query
JOIN feedback fb ON fb.run_id = b.id AND fb.score = 1
WHERE a.id != b.id
AND a.feature = b.feature
LIMIT 5000
""")
# Write DPO dataset
with open(output_file, "w") as f:
for pair in list(correction_pairs) + list(paired_outcomes):
f.write(json.dumps({
"prompt": pair["prompt"],
"chosen": pair["chosen"],
"rejected": pair["rejected"],
}) + "\n")
print(f"Created {len(correction_pairs) + len(paired_outcomes)} DPO pairs")5. Running DPO Fine-Tuning
pip install trl transformers datasets
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# Load your collected preference pairs
dataset = load_dataset("json", data_files="dpo_pairs.jsonl")["train"]
dpo_config = DPOConfig(
output_dir="./llama3-dpo-finetuned",
beta=0.1, # DPO temperature — lower = stronger preference alignment
learning_rate=5e-7, # Very low LR for alignment (much lower than SFT)
per_device_train_batch_size=4,
num_train_epochs=3,
max_length=2048,
fp16=True,
)
trainer = DPOTrainer(
model=model,
ref_model=None, # Will auto-create a frozen copy for KL divergence
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./llama3-dpo-finetuned")
# The fine-tuned model now prefers answers your users prefer!Frequently Asked Questions
How many preference pairs do I need for DPO fine-tuning?
Start with 1,000 high-quality pairs (using human-corrected responses as "chosen"). You can see meaningful alignment improvements with as few as 500 pairs on a narrow domain. For general-purpose alignment like OpenAI's RLHF training, you need millions — but for a specialized use case (customer support bot, code assistant), 1,000-5,000 domain-specific pairs are highly effective.
What's the difference between DPO and RLHF?
Reinforcement Learning from Human Feedback (RLHF) requires training a separate reward model, then using PPO to optimize the LLM against that reward — a complex, unstable three-stage pipeline. Direct Preference Optimization (DPO) is a more recent technique that achieves the same alignment goals by fine-tuning directly on preference pairs, without a reward model or RL. DPO is simpler to implement, more stable to train, and achieves comparable results. Always use DPO for new alignment projects.
Conclusion
The feedback loop is what separates AI products that improve from those that stagnate. The infrastructure is straightforward: a runs table, a feedback table, and a pipeline that periodically mines preference pairs from negative feedback with corrections. Run DPO fine-tuning when you accumulate 1,000+ pairs. Evaluate the new model on your golden test set. If it improves, deploy. Repeat. This is the data flywheel — and it's how today's leading AI products compound their performance advantage over time.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.