Data Cleaning: Trashing the Trash
Dec 30, 2025 • 20 min read
Every major LLM breakthrough has come with a paper about data quality improvements, not just more parameters. The Llama 3 technical report credits careful data curation as a key factor in its performance. Phi-4's "textbook quality data" narrative shows a 14B model matching much larger models. The pattern is clear: better data beats more data. But raw web crawls contain 20-40% near-duplicate content, SEO spam, poor-quality auto-generated text, and accidentally included benchmark test questions. Cleaning this contamination systematically is as important as architecture choices.
1. Why Exact Deduplication Fails
The naive approach is to compute SHA256 hashes of every document and remove exact duplicates. This catches verbatim copies but misses the much more common near-duplicate problem: the same article with minor edits, syndicated news with slight variations, or the same product description across 50 e-commerce sites. Changing a single character changes the SHA256 entirely. For a web crawl of 10 billion documents, fully 30-40% might be near-duplicates that exact hashing completely misses.
2. MinHash LSH: Fuzzy Deduplication at Scale
pip install datasketch
from datasketch import MinHash, MinHashLSH
from multiprocessing import Pool
# MinHash converts a document to a compact "signature" (128 integers)
# Two similar documents → similar signatures → detected as duplicates
# Accuracy: similar docs with >90% word overlap → found with 95%+ recall
def create_minhash(text: str, num_perm: int = 128) -> MinHash:
"""Convert text to MinHash signature using 5-gram shingles."""
m = MinHash(num_perm=num_perm)
# 5-character n-grams (robust to whitespace/punctuation edits)
shingles = set(text[i:i+5] for i in range(len(text) - 4))
for shingle in shingles:
m.update(shingle.encode('utf8'))
return m
# Build LSH index
lsh = MinHashLSH(
threshold=0.85, # Jaccard similarity threshold: docs sharing 85%+ n-grams are duplicates
num_perm=128, # Number of hash permutations (higher = better recall, slower)
)
def deduplicate_corpus(documents: list[dict]) -> list[dict]:
"""Remove near-duplicate documents from a corpus."""
unique_docs = []
for i, doc in enumerate(documents):
minhash = create_minhash(doc['text'])
# Check if similar document already exists in index
similar_docs = lsh.query(minhash)
if not similar_docs:
# No duplicates found — keep this document
lsh.insert(f"doc_{i}", minhash)
unique_docs.append(doc)
else:
# Duplicate found — skip (keep first occurrence)
print(f"Duplicate of {similar_docs[0]}: {doc['url'][:60]}...")
return unique_docs
# Parallel processing for large corpora (millions of docs)
import concurrent.futures
def process_batch(batch):
return [(i, create_minhash(doc['text'])) for i, doc in batch]
# Process billion-scale datasets with streaming:
# RedPajama, Dolma, and FineWeb all use MinHash deduplication
# Typical result: 30-45% reduction in corpus size with same information density3. Heuristic Quality Filters
import re
from collections import Counter
# Quality filters used by major datasets (RedPajama, Dolma, FineWeb, ROOTS):
def quality_score(text: str) -> dict[str, bool]:
"""Multi-dimensional quality assessment for a text document."""
words = text.split()
word_count = len(words)
char_count = len(text)
sentences = text.split('.')
# 1. Minimum length filter
length_ok = word_count >= 50 and char_count >= 250
# 2. Stop word ratio (English natural language has many stop words)
stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'is', 'are', 'was'}
stop_ratio = sum(1 for w in words if w.lower() in stop_words) / max(word_count, 1)
has_stop_words = stop_ratio > 0.05 # Less than 5% stop words = likely not natural language
# 3. Average sentence length (too short = truncated/garbled; too long = code/CSV)
avg_sent_len = word_count / max(len(sentences), 1)
sentence_ok = 5 <= avg_sent_len <= 80
# 4. Symbol ratio (too many { } # = is code/JSON not natural language)
symbol_chars = sum(1 for c in text if c in '#{}[]|<>\')
symbol_ratio = symbol_chars / char_count
symbol_ok = symbol_ratio < 0.05
# 5. Line length ratio (many short lines = menu/list; might be OK for some training tasks)
lines = [l for l in text.split('
') if l.strip()]
avg_line_len = char_count / max(len(lines), 1)
line_len_ok = avg_line_len >= 40 # Extremely short lines → table of contents, navigation
# 6. Repetition detection (some low-quality web text repeats phrases)
word_freq = Counter(words)
most_common_ratio = word_freq.most_common(1)[0][1] / word_count if words else 0
no_excessive_repetition = most_common_ratio < 0.1 # Any word >10% of text = spam
# 7. Digit ratio (math textbooks high; pure number tables low quality)
digit_ratio = sum(1 for c in text if c.isdigit()) / max(char_count, 1)
digit_ok = digit_ratio < 0.3
return {
"length_ok": length_ok,
"has_stop_words": has_stop_words,
"sentence_ok": sentence_ok,
"symbol_ok": symbol_ok,
"line_len_ok": line_len_ok,
"no_excessive_repetition": no_excessive_repetition,
"digit_ok": digit_ok,
"passes_all": all([length_ok, has_stop_words, sentence_ok, symbol_ok,
line_len_ok, no_excessive_repetition, digit_ok]),
}
# Typical filtering rates for Common Crawl:
# Length filter: removes ~20% of documents
# Stop word filter: removes ~5% (detects non-English, code-heavy)
# Sentence length: removes ~8%
# Symbol ratio: removes ~7%
# Total: ~30-35% of raw Common Crawl passes all quality filters4. Benchmark Decontamination
# CRITICAL: If your training set contains benchmark questions, your model is "cheating"
# Models that memorize MMLU or HumanEval answers appear better than they are
# This is called "test set contamination" and invalidates evaluation
# Standard benchmark datasets to check for contamination:
BENCHMARKS = {
"mmlu": "cais/mmlu", # Multi-task language understanding
"humaneval": "openai/humaneval", # Python code generation
"gsm8k": "gsm8k", # Grade school math
"bigbench": "BIG-Bench", # Diverse reasoning tasks
"hellaswag": "Rowan/hellaswag", # Common sense reasoning
}
from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
def build_benchmark_index(benchmarks: list[str]) -> MinHashLSH:
"""Build a MinHash index of all benchmark test questions."""
lsh = MinHashLSH(threshold=0.6, num_perm=128) # Lower threshold for decontam
for benchmark_name in benchmarks:
dataset = load_dataset(benchmark_name, split="test")
for i, example in enumerate(dataset):
# Extract the question text (field name varies by benchmark)
question = example.get("question", example.get("prompt", ""))
if question:
minhash = create_minhash(question)
lsh.insert(f"{benchmark_name}_{i}", minhash)
return lsh
def decontaminate(training_docs: list[str], benchmark_lsh: MinHashLSH) -> list[str]:
"""Remove training documents that match benchmark test questions."""
clean_docs = []
contaminated = 0
for doc in training_docs:
minhash = create_minhash(doc)
matches = benchmark_lsh.query(minhash)
if matches:
contaminated += 1
else:
clean_docs.append(doc)
print(f"Removed {contaminated} contaminated documents ({contaminated/len(training_docs):.1%})")
return clean_docsFrequently Asked Questions
What deduplication threshold should I use?
The threshold controls how similar documents must be to be considered duplicates (Jaccard similarity). 0.9+ catches only near-identical documents with minor edits. 0.7-0.8 catches documents sharing most content but with paragraphs added/removed. For LLM pre-training, 0.8 is the standard (used by FineWeb, Dolma, RedPajama). Lower thresholds risk aggressive deduplication that removes genuinely distinct documents covering the same topic from different angles, which is actually valuable training signal.
How do I handle multilingual corpora?
Apply language detection (using fastText's language classifier, which covers 176 languages at 95%+ accuracy) before deduplication. Then deduplicate within each language separately — cross-language deduplication via character n-grams is unreliable. Apply quality filters tuned for each language's stop word patterns. For multilingual training, use language-balanced sampling rather than raw frequency to prevent English dominance.
Conclusion
Data cleaning for LLM training isn't a one-time step — it's a pipeline with deduplication, quality filtering, benchmark decontamination, and content safety filtering applied in sequence. MinHash LSH provides scalable fuzzy deduplication processing millions of documents per hour. Heuristic filters (stop word ratio, sentence length, symbol ratio) capture the bulk of low-quality content cheaply before expensive LLM-based quality scoring. Always run benchmark decontamination to ensure your evaluation metrics reflect genuine generalization rather than memorization.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.