⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Multimodal Data: Images + Text at Scale

Dec 30, 2025 • 20 min read

Training a model like Stable Diffusion requires billions of image-caption pairs. Llava (a vision-language model) was trained on 600,000 image-instruction pairs curated from CC3M. You cannot store a billion individual JPEG files in a normal file system — you'll exhaust filesystem inodes before you run out of disk space. The solution is a specialized dataset format called WebDataset: packing thousands of image-caption pairs into large tar files optimized for sequential streaming to GPU training processes.

1. The Problem with Naive Image Storage

Inode exhaustion: Linux filesystems typically support 64M-256M inodes. Storing 1 billion files as individual files exceeds this limit
Random I/O bottleneck: Training loops that read individual files perform random disk seeks — SSDs handle 100K IOPS, but sequential reads are 10-100x faster
Small file overhead: Each file has filesystem metadata (inode, directory entry). For millions of 100KB files, this overhead becomes significant
Network transfer inefficiency: Copying individual small files from S3 to GPU nodes is extremely slow due to HTTP request overhead per file

2. WebDataset Format: Tar Sharding

WebDataset packs thousands of image-caption pairs into sequential tar archives called "shards." Each shard is typically 1-2 GB:

# Shard structure — each sample shares the same basename
shard_00000.tar
├── 000000.jpg        # Image
├── 000000.txt        # Caption: "A golden retriever playing in a park"
├── 000000.json       # Metadata: {"width": 1024, "height": 768, "source_url": "..."}
├── 000001.jpg
├── 000001.txt
├── 000001.json
...
└── 009999.jpg        # 10,000 samples per shard

# Total dataset structure:
dataset/
├── shard_00000.tar   # 1.8 GB
├── shard_00001.tar   # 1.7 GB
...
└── shard_09999.tar   # = 100,000 shards × ~10,000 images = 1 billion pairs

# Reading WebDataset in PyTorch
pip install webdataset

import webdataset as wds
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
])

dataset = (
    wds.WebDataset("s3://my-bucket/dataset/shard_{00000..09999}.tar",
                   shardshuffle=True)  # Shuffle shard order for training
    .shuffle(1000)                     # Shuffle within buffer
    .decode("pil")                     # Decode JPEGs to PIL Images
    .to_tuple("jpg", "txt")            # Extract image + caption pairs
    .map_tuple(transform, str.lower)   # Apply transforms
    .batched(32)                       # Batch size
)

3. Building a WebDataset with img2dataset

img2dataset (from Hugging Face) is the standard tool for downloading URLs and converting them into WebDataset shards:

pip install img2dataset

# Input: a CSV with URLs and captions
# url_list.csv:
# url,caption
# https://example.com/image1.jpg,"A cat sitting on a mat"
# https://example.com/image2.jpg,"A dog running on a beach"

img2dataset \
    --url_list url_list.csv \
    --input_format=csv \
    --url_col=url \
    --caption_col=caption \
    --output_folder=./dataset \
    --output_format=webdataset \  # Outputs tar shards
    --processes_count=16 \         # Parallel download workers
    --thread_count=64 \            # Threads per worker
    --image_size=512 \             # Resize all images to 512x512
    --resize_mode=center_crop \
    --min_image_size=200 \         # Skip images smaller than 200px
    --max_aspect_ratio=3.0 \       # Skip very panoramic images
    --number_sample_per_shard=10000 \
    --save_additional_columns='[similarity,aesthetic_score]'  # Keep extra metadata columns

4. CLIP Score Filtering

Most scraped web data is low quality — images with captions like "IMG_0045.JPG" or stock photos with generic alt text. CLIP filtering removes samples where the image and caption don't actually correspond:

import torch
import clip
from PIL import Image
import webdataset as wds
import io

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def compute_clip_score(image: Image.Image, caption: str) -> float:
    """Compute CLIP cosine similarity between image and text."""
    image_input = preprocess(image).unsqueeze(0).to(device)
    text_input = clip.tokenize([caption], truncate=True).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)
        
        # Normalize and compute cosine similarity
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
        similarity = (image_features @ text_features.T).item()
    
    return similarity

CLIP_THRESHOLD = 0.28  # Discard pairs below this similarity score
# LAION-5B was filtered at 0.28 — higher = more conservative (fewer but better samples)

def filter_by_clip(sample):
    """Filter function for WebDataset pipeline."""
    image = Image.open(io.BytesIO(sample["jpg"])).convert("RGB")
    caption = sample["txt"].decode("utf-8")
    
    score = compute_clip_score(image, caption)
    sample["clip_score"] = score
    
    return score >= CLIP_THRESHOLD  # Return True = keep, False = discard

# Apply filtering to dataset
filtered_dataset = (
    wds.WebDataset("s3://bucket/raw_shards/shard_{00000..99999}.tar")
    .select(filter_by_clip)  # Filter step
    .to_tuple("jpg", "txt", "clip_score")
)

# Typically: 30-50% of scraped data is filtered out by CLIP scoring

5. LAION-5B: The Reference Dataset

LAION-5B is the largest publicly available multimodal dataset (5.85 billion image-text pairs). Understanding its structure helps when building custom datasets:

Subset	Size	Filter Criteria
LAION-5B (full)	5.85B pairs	CLIP score > 0.28, image > 200px, aspect ratio < 3
LAION-2B-en	2.32B pairs	English captions only via language detection
LAION-Aesthetics v2	600M pairs	Aesthetic score > 4.5 (LAION aesthetic predictor)
LAION-Aesthetics v2.5+	120M pairs	Aesthetic score > 5.0 — highest quality subset

6. Aesthetic Scoring

# LAION Aesthetic Predictor — available on HuggingFace Hub
from transformers import pipeline
import torch

# Predicts aesthetic quality score 1-10 for images
aesthetic_scorer = pipeline(
    "image-classification",
    model="cafeai/cafe_aesthetic",  # CLIP-based aesthetic quality model
    device=0 if torch.cuda.is_available() else -1,
)

def get_aesthetic_score(image: Image.Image) -> float:
    results = aesthetic_scorer(image)
    # Returns [{"label": "aesthetic", "score": 0.85}, {"label": "not_aesthetic", "score": 0.15}]
    return results[0]["score"] if results[0]["label"] == "aesthetic" else results[1]["score"]

# For fine-tuning Stable Diffusion: use LAION-Aesthetics v2.5+ (score > 5.0)
# This gives you 120M high-quality image-text pairs to train on

Frequently Asked Questions

How much compute do I need to process 1 billion image-text pairs?

Downloading and packing 1M images with img2dataset takes ~30 minutes on 16 CPU cores with a 1Gbps connection. CLIP scoring each image-text pair takes ~0.5ms on an A100 — so 1 billion pairs ≈ 500,000 seconds ≈ 6 GPU-days. In practice, teams use spot GPU instances in parallel across 16-32 GPUs, reducing wall time to 12-24 hours for the filtering pass.

Can I build a high-quality dataset much smaller than LAION?

Absolutely — and often smaller curated datasets outperform larger scraped ones. Llava-1.5 used only 665K instruction-following image samples. For domain-specific models (medical imaging, satellite imagery, product photos), 100K high-quality pairs with domain-specific captions will outperform 10M generic internet pairs. Focus on quality and domain relevance over raw size.

Conclusion

WebDataset format, img2dataset pipeline, and CLIP score filtering are the three core tools for building production-grade multimodal datasets. The key insight is that quality filtering (CLIP score + aesthetic scoring) typically discards 50-70% of raw scraped data — but the filtered dataset trains significantly better models. If you're building a domain-specific vision-language model, start with LAION-Aesthetics as a base and supplement with curated domain-specific data rather than starting from raw internet scrapes.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact