Multimodal Data: Images + Text at Scale
Dec 30, 2025 ⢠20 min read
Training a model like Stable Diffusion requires billions of image-caption pairs. Llava (a vision-language model) was trained on 600,000 image-instruction pairs curated from CC3M. You cannot store a billion individual JPEG files in a normal file system ā you'll exhaust filesystem inodes before you run out of disk space. The solution is a specialized dataset format called WebDataset: packing thousands of image-caption pairs into large tar files optimized for sequential streaming to GPU training processes.
1. The Problem with Naive Image Storage
- Inode exhaustion: Linux filesystems typically support 64M-256M inodes. Storing 1 billion files as individual files exceeds this limit
- Random I/O bottleneck: Training loops that read individual files perform random disk seeks ā SSDs handle 100K IOPS, but sequential reads are 10-100x faster
- Small file overhead: Each file has filesystem metadata (inode, directory entry). For millions of 100KB files, this overhead becomes significant
- Network transfer inefficiency: Copying individual small files from S3 to GPU nodes is extremely slow due to HTTP request overhead per file
2. WebDataset Format: Tar Sharding
WebDataset packs thousands of image-caption pairs into sequential tar archives called "shards." Each shard is typically 1-2 GB:
# Shard structure ā each sample shares the same basename
shard_00000.tar
āāā 000000.jpg # Image
āāā 000000.txt # Caption: "A golden retriever playing in a park"
āāā 000000.json # Metadata: {"width": 1024, "height": 768, "source_url": "..."}
āāā 000001.jpg
āāā 000001.txt
āāā 000001.json
...
āāā 009999.jpg # 10,000 samples per shard
# Total dataset structure:
dataset/
āāā shard_00000.tar # 1.8 GB
āāā shard_00001.tar # 1.7 GB
...
āāā shard_09999.tar # = 100,000 shards Ć ~10,000 images = 1 billion pairs
# Reading WebDataset in PyTorch
pip install webdataset
import webdataset as wds
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
])
dataset = (
wds.WebDataset("s3://my-bucket/dataset/shard_{00000..09999}.tar",
shardshuffle=True) # Shuffle shard order for training
.shuffle(1000) # Shuffle within buffer
.decode("pil") # Decode JPEGs to PIL Images
.to_tuple("jpg", "txt") # Extract image + caption pairs
.map_tuple(transform, str.lower) # Apply transforms
.batched(32) # Batch size
)3. Building a WebDataset with img2dataset
img2dataset (from Hugging Face) is the standard tool for downloading URLs and converting them into WebDataset shards:
pip install img2dataset
# Input: a CSV with URLs and captions
# url_list.csv:
# url,caption
# https://example.com/image1.jpg,"A cat sitting on a mat"
# https://example.com/image2.jpg,"A dog running on a beach"
img2dataset \
--url_list url_list.csv \
--input_format=csv \
--url_col=url \
--caption_col=caption \
--output_folder=./dataset \
--output_format=webdataset \ # Outputs tar shards
--processes_count=16 \ # Parallel download workers
--thread_count=64 \ # Threads per worker
--image_size=512 \ # Resize all images to 512x512
--resize_mode=center_crop \
--min_image_size=200 \ # Skip images smaller than 200px
--max_aspect_ratio=3.0 \ # Skip very panoramic images
--number_sample_per_shard=10000 \
--save_additional_columns='[similarity,aesthetic_score]' # Keep extra metadata columns4. CLIP Score Filtering
Most scraped web data is low quality ā images with captions like "IMG_0045.JPG" or stock photos with generic alt text. CLIP filtering removes samples where the image and caption don't actually correspond:
import torch
import clip
from PIL import Image
import webdataset as wds
import io
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def compute_clip_score(image: Image.Image, caption: str) -> float:
"""Compute CLIP cosine similarity between image and text."""
image_input = preprocess(image).unsqueeze(0).to(device)
text_input = clip.tokenize([caption], truncate=True).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_input)
# Normalize and compute cosine similarity
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
similarity = (image_features @ text_features.T).item()
return similarity
CLIP_THRESHOLD = 0.28 # Discard pairs below this similarity score
# LAION-5B was filtered at 0.28 ā higher = more conservative (fewer but better samples)
def filter_by_clip(sample):
"""Filter function for WebDataset pipeline."""
image = Image.open(io.BytesIO(sample["jpg"])).convert("RGB")
caption = sample["txt"].decode("utf-8")
score = compute_clip_score(image, caption)
sample["clip_score"] = score
return score >= CLIP_THRESHOLD # Return True = keep, False = discard
# Apply filtering to dataset
filtered_dataset = (
wds.WebDataset("s3://bucket/raw_shards/shard_{00000..99999}.tar")
.select(filter_by_clip) # Filter step
.to_tuple("jpg", "txt", "clip_score")
)
# Typically: 30-50% of scraped data is filtered out by CLIP scoring5. LAION-5B: The Reference Dataset
LAION-5B is the largest publicly available multimodal dataset (5.85 billion image-text pairs). Understanding its structure helps when building custom datasets:
| Subset | Size | Filter Criteria |
|---|---|---|
| LAION-5B (full) | 5.85B pairs | CLIP score > 0.28, image > 200px, aspect ratio < 3 |
| LAION-2B-en | 2.32B pairs | English captions only via language detection |
| LAION-Aesthetics v2 | 600M pairs | Aesthetic score > 4.5 (LAION aesthetic predictor) |
| LAION-Aesthetics v2.5+ | 120M pairs | Aesthetic score > 5.0 ā highest quality subset |
6. Aesthetic Scoring
# LAION Aesthetic Predictor ā available on HuggingFace Hub
from transformers import pipeline
import torch
# Predicts aesthetic quality score 1-10 for images
aesthetic_scorer = pipeline(
"image-classification",
model="cafeai/cafe_aesthetic", # CLIP-based aesthetic quality model
device=0 if torch.cuda.is_available() else -1,
)
def get_aesthetic_score(image: Image.Image) -> float:
results = aesthetic_scorer(image)
# Returns [{"label": "aesthetic", "score": 0.85}, {"label": "not_aesthetic", "score": 0.15}]
return results[0]["score"] if results[0]["label"] == "aesthetic" else results[1]["score"]
# For fine-tuning Stable Diffusion: use LAION-Aesthetics v2.5+ (score > 5.0)
# This gives you 120M high-quality image-text pairs to train onFrequently Asked Questions
How much compute do I need to process 1 billion image-text pairs?
Downloading and packing 1M images with img2dataset takes ~30 minutes on 16 CPU cores with a 1Gbps connection. CLIP scoring each image-text pair takes ~0.5ms on an A100 ā so 1 billion pairs ā 500,000 seconds ā 6 GPU-days. In practice, teams use spot GPU instances in parallel across 16-32 GPUs, reducing wall time to 12-24 hours for the filtering pass.
Can I build a high-quality dataset much smaller than LAION?
Absolutely ā and often smaller curated datasets outperform larger scraped ones. Llava-1.5 used only 665K instruction-following image samples. For domain-specific models (medical imaging, satellite imagery, product photos), 100K high-quality pairs with domain-specific captions will outperform 10M generic internet pairs. Focus on quality and domain relevance over raw size.
Conclusion
WebDataset format, img2dataset pipeline, and CLIP score filtering are the three core tools for building production-grade multimodal datasets. The key insight is that quality filtering (CLIP score + aesthetic scoring) typically discards 50-70% of raw scraped data ā but the filtered dataset trains significantly better models. If you're building a domain-specific vision-language model, start with LAION-Aesthetics as a base and supplement with curated domain-specific data rather than starting from raw internet scrapes.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning ā no fluff, just working code and real-world context.