opncrafter

Data Cleaning: Trashing the Trash

Dec 30, 2025 • 20 min read

Every major LLM breakthrough has come with a paper about data quality improvements, not just more parameters. The Llama 3 technical report credits careful data curation as a key factor in its performance. Phi-4's "textbook quality data" narrative shows a 14B model matching much larger models. The pattern is clear: better data beats more data. But raw web crawls contain 20-40% near-duplicate content, SEO spam, poor-quality auto-generated text, and accidentally included benchmark test questions. Cleaning this contamination systematically is as important as architecture choices.

1. Why Exact Deduplication Fails

The naive approach is to compute SHA256 hashes of every document and remove exact duplicates. This catches verbatim copies but misses the much more common near-duplicate problem: the same article with minor edits, syndicated news with slight variations, or the same product description across 50 e-commerce sites. Changing a single character changes the SHA256 entirely. For a web crawl of 10 billion documents, fully 30-40% might be near-duplicates that exact hashing completely misses.

2. MinHash LSH: Fuzzy Deduplication at Scale

pip install datasketch

from datasketch import MinHash, MinHashLSH
from multiprocessing import Pool

# MinHash converts a document to a compact "signature" (128 integers)
# Two similar documents → similar signatures → detected as duplicates
# Accuracy: similar docs with >90% word overlap → found with 95%+ recall

def create_minhash(text: str, num_perm: int = 128) -> MinHash:
    """Convert text to MinHash signature using 5-gram shingles."""
    m = MinHash(num_perm=num_perm)
    # 5-character n-grams (robust to whitespace/punctuation edits)
    shingles = set(text[i:i+5] for i in range(len(text) - 4))
    for shingle in shingles:
        m.update(shingle.encode('utf8'))
    return m

# Build LSH index
lsh = MinHashLSH(
    threshold=0.85,     # Jaccard similarity threshold: docs sharing 85%+ n-grams are duplicates
    num_perm=128,        # Number of hash permutations (higher = better recall, slower)
)

def deduplicate_corpus(documents: list[dict]) -> list[dict]:
    """Remove near-duplicate documents from a corpus."""
    unique_docs = []
    
    for i, doc in enumerate(documents):
        minhash = create_minhash(doc['text'])
        
        # Check if similar document already exists in index
        similar_docs = lsh.query(minhash)
        
        if not similar_docs:
            # No duplicates found — keep this document
            lsh.insert(f"doc_{i}", minhash)
            unique_docs.append(doc)
        else:
            # Duplicate found — skip (keep first occurrence)
            print(f"Duplicate of {similar_docs[0]}: {doc['url'][:60]}...")
    
    return unique_docs

# Parallel processing for large corpora (millions of docs)
import concurrent.futures

def process_batch(batch):
    return [(i, create_minhash(doc['text'])) for i, doc in batch]

# Process billion-scale datasets with streaming:
# RedPajama, Dolma, and FineWeb all use MinHash deduplication
# Typical result: 30-45% reduction in corpus size with same information density

3. Heuristic Quality Filters

import re
from collections import Counter

# Quality filters used by major datasets (RedPajama, Dolma, FineWeb, ROOTS):

def quality_score(text: str) -> dict[str, bool]:
    """Multi-dimensional quality assessment for a text document."""
    words = text.split()
    word_count = len(words)
    char_count = len(text)
    sentences = text.split('.')
    
    # 1. Minimum length filter
    length_ok = word_count >= 50 and char_count >= 250
    
    # 2. Stop word ratio (English natural language has many stop words)
    stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'is', 'are', 'was'}
    stop_ratio = sum(1 for w in words if w.lower() in stop_words) / max(word_count, 1)
    has_stop_words = stop_ratio > 0.05  # Less than 5% stop words = likely not natural language
    
    # 3. Average sentence length (too short = truncated/garbled; too long = code/CSV)
    avg_sent_len = word_count / max(len(sentences), 1)
    sentence_ok = 5 <= avg_sent_len <= 80
    
    # 4. Symbol ratio (too many { } # = is code/JSON not natural language)
    symbol_chars = sum(1 for c in text if c in '#{}[]|<>\')
    symbol_ratio = symbol_chars / char_count
    symbol_ok = symbol_ratio < 0.05
    
    # 5. Line length ratio (many short lines = menu/list; might be OK for some training tasks)
    lines = [l for l in text.split('
') if l.strip()]
    avg_line_len = char_count / max(len(lines), 1)
    line_len_ok = avg_line_len >= 40  # Extremely short lines → table of contents, navigation
    
    # 6. Repetition detection (some low-quality web text repeats phrases)
    word_freq = Counter(words)
    most_common_ratio = word_freq.most_common(1)[0][1] / word_count if words else 0
    no_excessive_repetition = most_common_ratio < 0.1  # Any word >10% of text = spam
    
    # 7. Digit ratio (math textbooks high; pure number tables low quality)
    digit_ratio = sum(1 for c in text if c.isdigit()) / max(char_count, 1)
    digit_ok = digit_ratio < 0.3
    
    return {
        "length_ok": length_ok,
        "has_stop_words": has_stop_words,
        "sentence_ok": sentence_ok,
        "symbol_ok": symbol_ok,
        "line_len_ok": line_len_ok,
        "no_excessive_repetition": no_excessive_repetition,
        "digit_ok": digit_ok,
        "passes_all": all([length_ok, has_stop_words, sentence_ok, symbol_ok, 
                           line_len_ok, no_excessive_repetition, digit_ok]),
    }

# Typical filtering rates for Common Crawl:
# Length filter: removes ~20% of documents
# Stop word filter: removes ~5% (detects non-English, code-heavy)
# Sentence length: removes ~8%
# Symbol ratio: removes ~7%
# Total: ~30-35% of raw Common Crawl passes all quality filters

4. Benchmark Decontamination

# CRITICAL: If your training set contains benchmark questions, your model is "cheating"
# Models that memorize MMLU or HumanEval answers appear better than they are
# This is called "test set contamination" and invalidates evaluation

# Standard benchmark datasets to check for contamination:
BENCHMARKS = {
    "mmlu": "cais/mmlu",              # Multi-task language understanding
    "humaneval": "openai/humaneval",   # Python code generation
    "gsm8k": "gsm8k",                 # Grade school math
    "bigbench": "BIG-Bench",          # Diverse reasoning tasks
    "hellaswag": "Rowan/hellaswag",   # Common sense reasoning
}

from datasets import load_dataset
from datasketch import MinHash, MinHashLSH

def build_benchmark_index(benchmarks: list[str]) -> MinHashLSH:
    """Build a MinHash index of all benchmark test questions."""
    lsh = MinHashLSH(threshold=0.6, num_perm=128)  # Lower threshold for decontam
    
    for benchmark_name in benchmarks:
        dataset = load_dataset(benchmark_name, split="test")
        
        for i, example in enumerate(dataset):
            # Extract the question text (field name varies by benchmark)
            question = example.get("question", example.get("prompt", ""))
            if question:
                minhash = create_minhash(question)
                lsh.insert(f"{benchmark_name}_{i}", minhash)
    
    return lsh

def decontaminate(training_docs: list[str], benchmark_lsh: MinHashLSH) -> list[str]:
    """Remove training documents that match benchmark test questions."""
    clean_docs = []
    contaminated = 0
    
    for doc in training_docs:
        minhash = create_minhash(doc)
        matches = benchmark_lsh.query(minhash)
        
        if matches:
            contaminated += 1
        else:
            clean_docs.append(doc)
    
    print(f"Removed {contaminated} contaminated documents ({contaminated/len(training_docs):.1%})")
    return clean_docs

Frequently Asked Questions

What deduplication threshold should I use?

The threshold controls how similar documents must be to be considered duplicates (Jaccard similarity). 0.9+ catches only near-identical documents with minor edits. 0.7-0.8 catches documents sharing most content but with paragraphs added/removed. For LLM pre-training, 0.8 is the standard (used by FineWeb, Dolma, RedPajama). Lower thresholds risk aggressive deduplication that removes genuinely distinct documents covering the same topic from different angles, which is actually valuable training signal.

How do I handle multilingual corpora?

Apply language detection (using fastText's language classifier, which covers 176 languages at 95%+ accuracy) before deduplication. Then deduplicate within each language separately — cross-language deduplication via character n-grams is unreliable. Apply quality filters tuned for each language's stop word patterns. For multilingual training, use language-balanced sampling rather than raw frequency to prevent English dominance.

Conclusion

Data cleaning for LLM training isn't a one-time step — it's a pipeline with deduplication, quality filtering, benchmark decontamination, and content safety filtering applied in sequence. MinHash LSH provides scalable fuzzy deduplication processing millions of documents per hour. Heuristic filters (stop word ratio, sentence length, symbol ratio) capture the bulk of low-quality content cheaply before expensive LLM-based quality scoring. Always run benchmark decontamination to ensure your evaluation metrics reflect genuine generalization rather than memorization.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK