Argilla: Human Feedback Platform
Dec 30, 2025 • 20 min read
Fine-tuning isn't just about dumping text into a model. The models that consistently outperform on real user tasks are aligned to human preferences — trained not just to predict text, but to predict the text humans actually prefer. Argilla is the open-source platform for collecting these preference signals at scale: showing annotators two model outputs side-by-side, collecting their preferences, and exporting the results directly to HuggingFace Hub for DPO training.
1. Argilla vs. Alternative Labeling Tools
| Tool | Best For | LLM Integration |
|---|---|---|
| Argilla | LLM preference data, SFT datasets, NER, text classification | Native HuggingFace integration, FeedbackDataset for DPO |
| Label Studio | General-purpose: images, audio, text, time series | Manual export required, no native HF Hub sync |
| Prodigy | Expert annotation with active learning built-in | Commercial license, Python-native, good for NLP tasks |
| Scale AI | High-volume, managed crowdworker annotation | Very expensive, used by major frontier AI labs |
2. Setup: Argilla on HuggingFace Spaces
# Option 1: Deploy on HuggingFace Spaces (easiest)
# 1. Go to huggingface.co/new-space
# 2. Choose Docker template → search "Argilla"
# 3. Set space to Private, deploy
# → Argilla UI available at your-username/your-space-name.hf.space
# Option 2: Local Docker deployment
docker run -d --name argilla \
-p 6900:6900 \
-e "ARGILLA_HOME_PATH=/var/lib/argilla" \
-v argilla_data:/var/lib/argilla \
argilla/argilla-server:latest
# Access at http://localhost:6900
# Default credentials: username=argilla, password=1234 (change immediately!)
# Python SDK
pip install argilla
import argilla as rg
# Connect to your Argilla instance
rg.init(
api_url="https://your-username-argilla.hf.space",
api_key="your-api-key", # Found in Argilla UI → Settings → API key
workspace="my-team",
)3. Creating a DPO Preference Dataset
import argilla as rg
# Define the annotation schema
dataset = rg.FeedbackDataset(
fields=[
rg.TextField(name="prompt", title="User Prompt", use_markdown=True),
rg.TextField(name="response_a", title="Response A (Model v1)", use_markdown=True),
rg.TextField(name="response_b", title="Response B (Model v2)", use_markdown=True),
],
questions=[
# Primary preference question
rg.LabelQuestion(
name="preference",
title="Which response is better overall?",
labels=["Response A", "Response B", "Both Good", "Both Bad"],
required=True,
),
# Optional quality dimensions
rg.RatingQuestion(
name="response_a_quality",
title="Rate Response A (1=Poor, 5=Excellent)",
values=[1, 2, 3, 4, 5],
required=False,
),
rg.RatingQuestion(
name="response_b_quality",
title="Rate Response B (1=Poor, 5=Excellent)",
values=[1, 2, 3, 4, 5],
required=False,
),
# Ask annotators to explain their reasoning
rg.TextQuestion(
name="reasoning",
title="Why did you prefer this response? (Optional)",
required=False,
),
],
guidelines="""
## Annotation Guidelines
- Prefer responses that are accurate and factually correct
- Prefer concise responses over verbose ones where brevity doesn't lose information
- Prefer responses that directly address the user's actual intent
- Mark "Both Bad" if both responses contain factual errors
- Your reasoning helps us understand quality patterns — please explain when time permits
""",
)
# Push to HuggingFace Hub for team annotation
dataset.push_to_huggingface("my-org/llm-preference-data-v1", private=True)4. Populating with Model Outputs
from openai import OpenAI
import argilla as rg
client = OpenAI()
# Sample prompts from your production logs (real user queries)
prompts = [
"How do I center a div in CSS?",
"Explain the CAP theorem in simple terms",
"What's the difference between REST and GraphQL?",
# ... your actual user queries
]
def get_two_responses(prompt: str, model_a: str = "gpt-4o", model_b: str = "gpt-4o-mini"):
"""Generate two responses from different models (or same model, different prompts)."""
resp_a = client.chat.completions.create(
model=model_a,
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
resp_b = client.chat.completions.create(
model=model_b,
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
return resp_a, resp_b
# Create annotation records
records = []
for prompt in prompts:
resp_a, resp_b = get_two_responses(prompt)
records.append(rg.FeedbackRecord(
fields={
"prompt": prompt,
"response_a": resp_a,
"response_b": resp_b,
},
metadata={"model_a": "gpt-4o", "model_b": "gpt-4o-mini"},
))
# Load existing dataset and add records
existing_dataset = rg.FeedbackDataset.from_huggingface("my-org/llm-preference-data-v1")
existing_dataset.add_records(records)
print(f"Added {len(records)} annotation tasks")5. Exporting Annotated Data for DPO Training
# After annotators have labeled records, export for DPO training
dataset = rg.FeedbackDataset.from_huggingface("my-org/llm-preference-data-v1")
annotated_records = dataset.filter_by(response_status=["submitted"])
dpo_pairs = []
for record in annotated_records:
# Get the majority preference (handle multiple annotators via voting)
preferences = [resp.values["preference"] for resp in record.responses if resp.values]
if not preferences:
continue
majority_vote = max(set(preferences), key=preferences.count)
if majority_vote == "Response A":
chosen, rejected = record.fields["response_a"], record.fields["response_b"]
elif majority_vote == "Response B":
chosen, rejected = record.fields["response_b"], record.fields["response_a"]
else:
continue # Skip "Both good/bad" — ambiguous signal
dpo_pairs.append({
"prompt": record.fields["prompt"],
"chosen": chosen,
"rejected": rejected,
})
print(f"Exported {len(dpo_pairs)} DPO training pairs")
# Save as HuggingFace Dataset for TRL DPO training
from datasets import Dataset
hf_dataset = Dataset.from_list(dpo_pairs)
hf_dataset.push_to_hub("my-org/dpo-preference-pairs-v1", private=True)6. Active Learning: Smart Sample Selection
# Don't annotate randomly — prioritize samples where model is most uncertain
# Use the divergence between model outputs as a proxy for uncertainty
import numpy as np
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
def annotation_priority_score(prompt: str, resp_a: str, resp_b: str) -> float:
"""
Prioritize samples where responses are very different (high divergence).
These are the most informative for the model to learn from.
"""
emb_a = embedder.encode(resp_a)
emb_b = embedder.encode(resp_b)
# Cosine similarity — lower = more different = higher priority to annotate
similarity = np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b))
return 1 - similarity # Return divergence score (higher = more different)
# Sort annotation tasks by priority before adding to Argilla
records_with_priority = [
(record, annotation_priority_score(record.fields["prompt"],
record.fields["response_a"],
record.fields["response_b"]))
for record in records
]
records_with_priority.sort(key=lambda x: x[1], reverse=True) # highest divergence first
# Add only top 500 most informative samples to Argilla
top_records = [r for r, _ in records_with_priority[:500]]Frequently Asked Questions
How many annotations do I need per record for reliable DPO data?
Research from Anthropic and Meta shows that 3+ annotators per record significantly improves data quality versus single-annotator labeling. With 3 annotators, you can compute inter-annotator agreement and discard ambiguous records (where annotators disagree). If budget is limited, use a single expert annotator for high-quality domain-specific tasks rather than 3 low-quality crowdworkers.
Can Argilla be used for tasks beyond preference data?
Yes — Argilla supports NER (Named Entity Recognition), text classification, question-answering verification, and span annotation. A common workflow: use Argilla to label customer support tickets for intent classification, export the labeled dataset, and fine-tune a small BERT-based classifier on it — eliminating the need for expensive LLM calls for every classification task.
Conclusion
Argilla transforms the feedback collection process from ad-hoc to systematic. The FeedbackDataset format, direct HuggingFace Hub integration, and Python SDK make it the most accessible open-source platform for building the preference datasets needed for DPO alignment. For teams fine-tuning domain-specific LLMs, even 1,000 high-quality annotated preference pairs collected through Argilla can meaningfully shift model behavior toward your users' preferences.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.