Synthetic Data: Infinite Tokens
Dec 30, 2025 • 20 min read
The world ran out of high-quality internet text. GPT-4 was already trained on a significant fraction of the public web, and future models can't simply scrape more. The solution the industry has converged on: use powerful frontier models to generate synthetic training data for smaller, specialized models. GPT-4 teaches Phi-3. Claude 3.5 teaches Llama 3.1 8B. This is how modern open-source models punch far above their parameter count.
1. Why Synthetic Data Works
Counterintuitively, a small model trained on high-quality synthetic data often outperforms the same model trained on large amounts of low-quality real data. The quality of the teacher signal matters more than volume. Key insight from Meta's open-source research: it's not the data quantity but the data distribution that determines fine-tuned model quality.
2. Self-Instruct: Bootstrapping from Scratch
The Self-Instruct paper (Wang et al., 2023) showed you can bootstrap an instruction-following dataset using a tiny seed set of hand-written examples:
from openai import OpenAI
client = OpenAI()
SEED_EXAMPLES = [
{"instruction": "Summarize this article in 3 bullet points.", "input": "...", "output": "..."},
{"instruction": "Write a Python function to sort a list.", "input": "", "output": "..."},
# 5-10 high-quality human-written examples
]
def generate_new_instructions(seed_pool: list, n: int = 20) -> list:
"""Use GPT-4o to generate new diverse instruction examples."""
# Sample a few seed examples as templates
import random
samples = random.sample(seed_pool, min(4, len(seed_pool)))
formatted_seeds = "\n".join([
f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput: {ex['output']}"
for ex in samples
])
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""You are generating diverse training examples.
Here are example instruction-input-output triplets:
{formatted_seeds}
Generate {n} NEW diverse instruction-input-output triplets in the same JSON format.
Make them varied in task type, difficulty, and domain. Never repeat existing instructions."""
}],
response_format={"type": "json_object"},
)
return response.choices[0].message.content # Parse and extend seed pool
# Run iteratively: seed → generate → filter → add to pool → repeat
all_examples = SEED_EXAMPLES.copy()
for round_num in range(10):
new_examples = generate_new_instructions(all_examples, n=50)
all_examples.extend(new_examples)
print(f"Round {round_num+1}: Total examples = {len(all_examples)}")3. Evol-Instruct: WizardLM's Complexity Amplifier
Evol-Instruct (used to create WizardLM datasets) doesn't just generate new examples — it takes existing simple examples and systematically makes them harder. This creates a difficulty curriculum for training:
EVOLUTION_STRATEGIES = [
"add_constraints", # Add specific edge cases to handle
"deepen", # Add more reasoning steps required
"concretize", # Replace abstract descriptions with concrete examples
"increase_reasoning", # Require step-by-step explanation
"complicate_input", # Make the input scenario more complex
]
def evolve_instruction(original: str, strategy: str) -> str:
prompts = {
"add_constraints": f"""Take this instruction and make it harder by adding specific constraints:
Original: "{original}"
Add 2-3 specific constraints (error handling, edge cases, performance requirements).
Return ONLY the enhanced instruction.""",
"deepen": f"""Take this instruction and require deeper understanding:
Original: "{original}"
Require the solution to also explain WHY each step works.
Return ONLY the enhanced instruction, starting it with "Explain step-by-step...".""",
"increase_reasoning": f"""Enhance this instruction to require multi-step reasoning:
Original: "{original}"
Make it require reasoning across multiple domains or concepts.
Return ONLY the enhanced instruction.""",
}
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompts[strategy]}]
)
return response.choices[0].message.content
# Example evolution chain:
step0 = "Write code to calculate factorial."
step1 = evolve_instruction(step0, "add_constraints")
# → "Write code to calculate factorial, handling integer overflow and negative inputs, using tail recursion."
step2 = evolve_instruction(step1, "deepen")
# → "Write and explain step-by-step code to calculate factorial, handling integer overflow..."
# → Each step is increasingly valuable training data4. Verification Loops: Preventing Model Collapse
Training on unverified synthetic data is dangerous. If GPT-4 generates a wrong answer and you train on it, you bake that error into your model — this is "model collapse." The solution: always run a verification pass:
# Verification strategy depends on task type:
# For CODE tasks: execute and check output
def verify_code_example(instruction: str, generated_code: str) -> bool:
"""Run the generated code and verify it doesn't crash."""
try:
import subprocess
result = subprocess.run(
["python", "-c", generated_code],
capture_output=True, timeout=10
)
return result.returncode == 0 # Accept if code runs without errors
except Exception:
return False # Reject broken examples
# For FACTUAL tasks: LLM-as-Judge verification
def verify_factual_example(question: str, answer: str) -> float:
"""Score the answer quality using a separate LLM."""
response = client.chat.completions.create(
model="gpt-4o", # Use a different/better model as critic
messages=[{
"role": "user",
"content": f"""Rate this answer on a scale of 1-10 for accuracy and completeness.
Question: {question}
Answer: {answer}
Output ONLY a JSON object: {{"score": <1-10>, "reason": "<brief>"}}"""
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result["score"]
# For REASONING tasks: consistency check (ask multiple times, compare)
def verify_reasoning_consistency(question: str, n: int = 3) -> bool:
answers = [generate_answer(question) for _ in range(n)]
# If answers are highly consistent → likely correct
# If answers vary widely → uncertain, reject
return check_consistency(answers)5. Quality Filtering Pipeline
import json
from datasets import Dataset
def filter_dataset(raw_examples: list) -> list:
"""Apply quality filters to synthetic examples."""
filtered = []
for ex in raw_examples:
# Filter 1: Length check — avoid very short or truncated responses
if len(ex["output"].split()) < 20 or len(ex["instruction"].split()) < 5:
continue
# Filter 2: Deduplication — skip if too similar to existing examples
if is_near_duplicate(ex["instruction"], [e["instruction"] for e in filtered]):
continue
# Filter 3: Quality score threshold
score = verify_factual_example(ex["instruction"], ex["output"])
if score < 7: # Reject low-quality examples
continue
# Filter 4: Harmful content check
if contains_harmful_content(ex["output"]):
continue
filtered.append({**ex, "quality_score": score})
# Sort by quality score and take top 80%
filtered.sort(key=lambda x: x["quality_score"], reverse=True)
return filtered[:int(len(filtered) * 0.8)]6. Uploading to Hugging Face for Fine-Tuning
from datasets import Dataset
from huggingface_hub import login
login(token="hf_...") # HuggingFace API token
# Format for most fine-tuning frameworks (Axolotl, Unsloth, etc.)
formatted = [{
"messages": [
{"role": "user", "content": ex["instruction"] + "\n" + ex.get("input", "")},
{"role": "assistant", "content": ex["output"]}
]
} for ex in filtered_examples]
dataset = Dataset.from_list(formatted)
dataset.push_to_hub("your-username/my-synthetic-dataset", private=True)
# Split train/test
splits = dataset.train_test_split(test_size=0.05)
splits.push_to_hub("your-username/my-synthetic-dataset")Frequently Asked Questions
Is it legal to use GPT-4 to generate training data for other models?
OpenAI's Terms of Service (as of 2024) explicitly prohibit using OpenAI model outputs to train competing AI models. If your goal is a competitor to OpenAI products, this is disallowed. Using Anthropic's Claude, open-source models (Llama 3.1, Mistral), or synthetic data generation libraries (like Magpie or Alpaca) is the compliant alternative for commercial synthetic data generation.
How much synthetic data do I need?
Quality matters far more than quantity. Meta's Llama-3.1-8B-Instruct was supervised fine-tuned on approximately 10 million synthetic examples. But smaller specialized models often achieve excellent results with 10,000-50,000 high-quality examples for a narrow task domain. Start small, evaluate, and scale only if needed.
Conclusion
Synthetic data engineering is now a core competency for teams building specialized LLM applications. The Self-Instruct and Evol-Instruct patterns let you bootstrap large, diverse training sets from minimal human effort. The critical discipline is verification — never train on unvalidated synthetic data. Executed well, a verified synthetic dataset can turn a 7B parameter model into a powerful specialist that outperforms 70B general-purpose models on your specific task.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.