opncrafter

Synthetic Data: Infinite Tokens

Dec 30, 2025 • 20 min read

The world ran out of high-quality internet text. GPT-4 was already trained on a significant fraction of the public web, and future models can't simply scrape more. The solution the industry has converged on: use powerful frontier models to generate synthetic training data for smaller, specialized models. GPT-4 teaches Phi-3. Claude 3.5 teaches Llama 3.1 8B. This is how modern open-source models punch far above their parameter count.

1. Why Synthetic Data Works

Counterintuitively, a small model trained on high-quality synthetic data often outperforms the same model trained on large amounts of low-quality real data. The quality of the teacher signal matters more than volume. Key insight from Meta's open-source research: it's not the data quantity but the data distribution that determines fine-tuned model quality.

2. Self-Instruct: Bootstrapping from Scratch

The Self-Instruct paper (Wang et al., 2023) showed you can bootstrap an instruction-following dataset using a tiny seed set of hand-written examples:

from openai import OpenAI

client = OpenAI()

SEED_EXAMPLES = [
    {"instruction": "Summarize this article in 3 bullet points.", "input": "...", "output": "..."},
    {"instruction": "Write a Python function to sort a list.", "input": "", "output": "..."},
    # 5-10 high-quality human-written examples
]

def generate_new_instructions(seed_pool: list, n: int = 20) -> list:
    """Use GPT-4o to generate new diverse instruction examples."""
    
    # Sample a few seed examples as templates
    import random
    samples = random.sample(seed_pool, min(4, len(seed_pool)))
    formatted_seeds = "\n".join([
        f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput: {ex['output']}"
        for ex in samples
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""You are generating diverse training examples.
Here are example instruction-input-output triplets:
{formatted_seeds}

Generate {n} NEW diverse instruction-input-output triplets in the same JSON format.
Make them varied in task type, difficulty, and domain. Never repeat existing instructions."""
        }],
        response_format={"type": "json_object"},
    )
    
    return response.choices[0].message.content  # Parse and extend seed pool

# Run iteratively: seed → generate → filter → add to pool → repeat
all_examples = SEED_EXAMPLES.copy()
for round_num in range(10):
    new_examples = generate_new_instructions(all_examples, n=50)
    all_examples.extend(new_examples)
    print(f"Round {round_num+1}: Total examples = {len(all_examples)}")

3. Evol-Instruct: WizardLM's Complexity Amplifier

Evol-Instruct (used to create WizardLM datasets) doesn't just generate new examples — it takes existing simple examples and systematically makes them harder. This creates a difficulty curriculum for training:

EVOLUTION_STRATEGIES = [
    "add_constraints",        # Add specific edge cases to handle
    "deepen",                 # Add more reasoning steps required
    "concretize",             # Replace abstract descriptions with concrete examples
    "increase_reasoning",     # Require step-by-step explanation
    "complicate_input",       # Make the input scenario more complex
]

def evolve_instruction(original: str, strategy: str) -> str:
    prompts = {
        "add_constraints": f"""Take this instruction and make it harder by adding specific constraints:
Original: "{original}"
Add 2-3 specific constraints (error handling, edge cases, performance requirements).
Return ONLY the enhanced instruction.""",

        "deepen": f"""Take this instruction and require deeper understanding:
Original: "{original}"
Require the solution to also explain WHY each step works.
Return ONLY the enhanced instruction, starting it with "Explain step-by-step...".""",

        "increase_reasoning": f"""Enhance this instruction to require multi-step reasoning:
Original: "{original}"
Make it require reasoning across multiple domains or concepts.
Return ONLY the enhanced instruction.""",
    }
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompts[strategy]}]
    )
    return response.choices[0].message.content

# Example evolution chain:
step0 = "Write code to calculate factorial."
step1 = evolve_instruction(step0, "add_constraints")
# → "Write code to calculate factorial, handling integer overflow and negative inputs, using tail recursion."
step2 = evolve_instruction(step1, "deepen")
# → "Write and explain step-by-step code to calculate factorial, handling integer overflow..."
# → Each step is increasingly valuable training data

4. Verification Loops: Preventing Model Collapse

Training on unverified synthetic data is dangerous. If GPT-4 generates a wrong answer and you train on it, you bake that error into your model — this is "model collapse." The solution: always run a verification pass:

# Verification strategy depends on task type:

# For CODE tasks: execute and check output
def verify_code_example(instruction: str, generated_code: str) -> bool:
    """Run the generated code and verify it doesn't crash."""
    try:
        import subprocess
        result = subprocess.run(
            ["python", "-c", generated_code],
            capture_output=True, timeout=10
        )
        return result.returncode == 0  # Accept if code runs without errors
    except Exception:
        return False  # Reject broken examples

# For FACTUAL tasks: LLM-as-Judge verification
def verify_factual_example(question: str, answer: str) -> float:
    """Score the answer quality using a separate LLM."""
    response = client.chat.completions.create(
        model="gpt-4o",  # Use a different/better model as critic
        messages=[{
            "role": "user",
            "content": f"""Rate this answer on a scale of 1-10 for accuracy and completeness.
Question: {question}
Answer: {answer}

Output ONLY a JSON object: {{"score": <1-10>, "reason": "<brief>"}}"""
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["score"]

# For REASONING tasks: consistency check (ask multiple times, compare)
def verify_reasoning_consistency(question: str, n: int = 3) -> bool:
    answers = [generate_answer(question) for _ in range(n)]
    # If answers are highly consistent → likely correct
    # If answers vary widely → uncertain, reject
    return check_consistency(answers)

5. Quality Filtering Pipeline

import json
from datasets import Dataset

def filter_dataset(raw_examples: list) -> list:
    """Apply quality filters to synthetic examples."""
    filtered = []
    
    for ex in raw_examples:
        # Filter 1: Length check — avoid very short or truncated responses
        if len(ex["output"].split()) < 20 or len(ex["instruction"].split()) < 5:
            continue
        
        # Filter 2: Deduplication — skip if too similar to existing examples
        if is_near_duplicate(ex["instruction"], [e["instruction"] for e in filtered]):
            continue
        
        # Filter 3: Quality score threshold
        score = verify_factual_example(ex["instruction"], ex["output"])
        if score < 7:  # Reject low-quality examples
            continue
        
        # Filter 4: Harmful content check
        if contains_harmful_content(ex["output"]):
            continue
        
        filtered.append({**ex, "quality_score": score})
    
    # Sort by quality score and take top 80%
    filtered.sort(key=lambda x: x["quality_score"], reverse=True)
    return filtered[:int(len(filtered) * 0.8)]

6. Uploading to Hugging Face for Fine-Tuning

from datasets import Dataset
from huggingface_hub import login

login(token="hf_...")  # HuggingFace API token

# Format for most fine-tuning frameworks (Axolotl, Unsloth, etc.)
formatted = [{
    "messages": [
        {"role": "user", "content": ex["instruction"] + "\n" + ex.get("input", "")},
        {"role": "assistant", "content": ex["output"]}
    ]
} for ex in filtered_examples]

dataset = Dataset.from_list(formatted)
dataset.push_to_hub("your-username/my-synthetic-dataset", private=True)

# Split train/test
splits = dataset.train_test_split(test_size=0.05)
splits.push_to_hub("your-username/my-synthetic-dataset")

Frequently Asked Questions

Is it legal to use GPT-4 to generate training data for other models?

OpenAI's Terms of Service (as of 2024) explicitly prohibit using OpenAI model outputs to train competing AI models. If your goal is a competitor to OpenAI products, this is disallowed. Using Anthropic's Claude, open-source models (Llama 3.1, Mistral), or synthetic data generation libraries (like Magpie or Alpaca) is the compliant alternative for commercial synthetic data generation.

How much synthetic data do I need?

Quality matters far more than quantity. Meta's Llama-3.1-8B-Instruct was supervised fine-tuned on approximately 10 million synthetic examples. But smaller specialized models often achieve excellent results with 10,000-50,000 high-quality examples for a narrow task domain. Start small, evaluate, and scale only if needed.

Conclusion

Synthetic data engineering is now a core competency for teams building specialized LLM applications. The Self-Instruct and Evol-Instruct patterns let you bootstrap large, diverse training sets from minimal human effort. The critical discipline is verification — never train on unvalidated synthetic data. Executed well, a verified synthetic dataset can turn a 7B parameter model into a powerful specialist that outperforms 70B general-purpose models on your specific task.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK