⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Argilla: Human Feedback Platform

Dec 30, 2025 • 20 min read

Fine-tuning isn't just about dumping text into a model. The models that consistently outperform on real user tasks are aligned to human preferences — trained not just to predict text, but to predict the text humans actually prefer. Argilla is the open-source platform for collecting these preference signals at scale: showing annotators two model outputs side-by-side, collecting their preferences, and exporting the results directly to HuggingFace Hub for DPO training.

1. Argilla vs. Alternative Labeling Tools

Tool	Best For	LLM Integration
Argilla	LLM preference data, SFT datasets, NER, text classification	Native HuggingFace integration, FeedbackDataset for DPO
Label Studio	General-purpose: images, audio, text, time series	Manual export required, no native HF Hub sync
Prodigy	Expert annotation with active learning built-in	Commercial license, Python-native, good for NLP tasks
Scale AI	High-volume, managed crowdworker annotation	Very expensive, used by major frontier AI labs

2. Setup: Argilla on HuggingFace Spaces

# Option 1: Deploy on HuggingFace Spaces (easiest)
# 1. Go to huggingface.co/new-space
# 2. Choose Docker template → search "Argilla"
# 3. Set space to Private, deploy
# → Argilla UI available at your-username/your-space-name.hf.space

# Option 2: Local Docker deployment
docker run -d --name argilla \
    -p 6900:6900 \
    -e "ARGILLA_HOME_PATH=/var/lib/argilla" \
    -v argilla_data:/var/lib/argilla \
    argilla/argilla-server:latest

# Access at http://localhost:6900
# Default credentials: username=argilla, password=1234 (change immediately!)

# Python SDK
pip install argilla

import argilla as rg

# Connect to your Argilla instance
rg.init(
    api_url="https://your-username-argilla.hf.space",
    api_key="your-api-key",  # Found in Argilla UI → Settings → API key
    workspace="my-team",
)

3. Creating a DPO Preference Dataset

import argilla as rg

# Define the annotation schema
dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="prompt", title="User Prompt", use_markdown=True),
        rg.TextField(name="response_a", title="Response A (Model v1)", use_markdown=True),
        rg.TextField(name="response_b", title="Response B (Model v2)", use_markdown=True),
    ],
    questions=[
        # Primary preference question
        rg.LabelQuestion(
            name="preference",
            title="Which response is better overall?",
            labels=["Response A", "Response B", "Both Good", "Both Bad"],
            required=True,
        ),
        # Optional quality dimensions
        rg.RatingQuestion(
            name="response_a_quality",
            title="Rate Response A (1=Poor, 5=Excellent)",
            values=[1, 2, 3, 4, 5],
            required=False,
        ),
        rg.RatingQuestion(
            name="response_b_quality", 
            title="Rate Response B (1=Poor, 5=Excellent)",
            values=[1, 2, 3, 4, 5],
            required=False,
        ),
        # Ask annotators to explain their reasoning
        rg.TextQuestion(
            name="reasoning",
            title="Why did you prefer this response? (Optional)",
            required=False,
        ),
    ],
    guidelines="""
## Annotation Guidelines
- Prefer responses that are accurate and factually correct
- Prefer concise responses over verbose ones where brevity doesn't lose information
- Prefer responses that directly address the user's actual intent
- Mark "Both Bad" if both responses contain factual errors
- Your reasoning helps us understand quality patterns — please explain when time permits
""",
)

# Push to HuggingFace Hub for team annotation
dataset.push_to_huggingface("my-org/llm-preference-data-v1", private=True)

4. Populating with Model Outputs

from openai import OpenAI
import argilla as rg

client = OpenAI()

# Sample prompts from your production logs (real user queries)
prompts = [
    "How do I center a div in CSS?",
    "Explain the CAP theorem in simple terms",
    "What's the difference between REST and GraphQL?",
    # ... your actual user queries
]

def get_two_responses(prompt: str, model_a: str = "gpt-4o", model_b: str = "gpt-4o-mini"):
    """Generate two responses from different models (or same model, different prompts)."""
    resp_a = client.chat.completions.create(
        model=model_a,
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content
    
    resp_b = client.chat.completions.create(
        model=model_b,
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content
    
    return resp_a, resp_b

# Create annotation records
records = []
for prompt in prompts:
    resp_a, resp_b = get_two_responses(prompt)
    records.append(rg.FeedbackRecord(
        fields={
            "prompt": prompt,
            "response_a": resp_a,
            "response_b": resp_b,
        },
        metadata={"model_a": "gpt-4o", "model_b": "gpt-4o-mini"},
    ))

# Load existing dataset and add records
existing_dataset = rg.FeedbackDataset.from_huggingface("my-org/llm-preference-data-v1")
existing_dataset.add_records(records)
print(f"Added {len(records)} annotation tasks")

5. Exporting Annotated Data for DPO Training

# After annotators have labeled records, export for DPO training
dataset = rg.FeedbackDataset.from_huggingface("my-org/llm-preference-data-v1")
annotated_records = dataset.filter_by(response_status=["submitted"])

dpo_pairs = []
for record in annotated_records:
    # Get the majority preference (handle multiple annotators via voting)
    preferences = [resp.values["preference"] for resp in record.responses if resp.values]
    
    if not preferences:
        continue
    
    majority_vote = max(set(preferences), key=preferences.count)
    
    if majority_vote == "Response A":
        chosen, rejected = record.fields["response_a"], record.fields["response_b"]
    elif majority_vote == "Response B":
        chosen, rejected = record.fields["response_b"], record.fields["response_a"]
    else:
        continue  # Skip "Both good/bad" — ambiguous signal
    
    dpo_pairs.append({
        "prompt": record.fields["prompt"],
        "chosen": chosen,
        "rejected": rejected,
    })

print(f"Exported {len(dpo_pairs)} DPO training pairs")

# Save as HuggingFace Dataset for TRL DPO training
from datasets import Dataset
hf_dataset = Dataset.from_list(dpo_pairs)
hf_dataset.push_to_hub("my-org/dpo-preference-pairs-v1", private=True)

6. Active Learning: Smart Sample Selection

# Don't annotate randomly — prioritize samples where model is most uncertain
# Use the divergence between model outputs as a proxy for uncertainty

import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

def annotation_priority_score(prompt: str, resp_a: str, resp_b: str) -> float:
    """
    Prioritize samples where responses are very different (high divergence).
    These are the most informative for the model to learn from.
    """
    emb_a = embedder.encode(resp_a)
    emb_b = embedder.encode(resp_b)
    
    # Cosine similarity — lower = more different = higher priority to annotate
    similarity = np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b))
    return 1 - similarity  # Return divergence score (higher = more different)

# Sort annotation tasks by priority before adding to Argilla
records_with_priority = [
    (record, annotation_priority_score(record.fields["prompt"], 
                                        record.fields["response_a"], 
                                        record.fields["response_b"]))
    for record in records
]
records_with_priority.sort(key=lambda x: x[1], reverse=True)  # highest divergence first

# Add only top 500 most informative samples to Argilla
top_records = [r for r, _ in records_with_priority[:500]]

Frequently Asked Questions

How many annotations do I need per record for reliable DPO data?

Research from Anthropic and Meta shows that 3+ annotators per record significantly improves data quality versus single-annotator labeling. With 3 annotators, you can compute inter-annotator agreement and discard ambiguous records (where annotators disagree). If budget is limited, use a single expert annotator for high-quality domain-specific tasks rather than 3 low-quality crowdworkers.

Can Argilla be used for tasks beyond preference data?

Yes — Argilla supports NER (Named Entity Recognition), text classification, question-answering verification, and span annotation. A common workflow: use Argilla to label customer support tickets for intent classification, export the labeled dataset, and fine-tune a small BERT-based classifier on it — eliminating the need for expensive LLM calls for every classification task.

Conclusion

Argilla transforms the feedback collection process from ad-hoc to systematic. The FeedbackDataset format, direct HuggingFace Hub integration, and Python SDK make it the most accessible open-source platform for building the preference datasets needed for DPO alignment. For teams fine-tuning domain-specific LLMs, even 1,000 high-quality annotated preference pairs collected through Argilla can meaningfully shift model behavior toward your users' preferences.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact