DVC: Git LFS on Steroids
Dec 30, 2025 • 20 min read
You version your code with Git. Your datasets live in a shared Dropbox folder with filenames like train_data_final_v3_FIXED_ACTUALLY_FINAL.csv. Every time someone runs training, they're not sure which version of the data produced the current production model. DVC (Data Version Control) solves this by treating datasets and models as first-class versioned artifacts — using Git for metadata pointers while storing actual data in S3, GCS, or Azure Blob Storage.
1. How DVC Works
DVC doesn't put your 10GB dataset into Git. Instead, it creates a small .dvc metadata file (like a symlink) that Git tracks, while the actual data lives in a remote storage backend:
# data/train.csv.dvc — what Git actually sees:
outs:
- md5: a1b2c3d4e5f6... # Hash of the file content
size: 10737418240 # 10 GB
path: train.csv
# The actual train.csv is in:
# - Your local cache: .dvc/cache/a1/b2c3d4e5f6...
# - Remote storage: s3://my-bucket/dvcstore/a1/b2c3d4e5f6...
#
# When a colleague does 'dvc pull', DVC downloads from S3 and
# places the file at data/train.csv — exact correct version, every time2. Installation and Setup
pip install dvc[s3] # For S3 storage backend
# Also available: dvc[gs] for GCS, dvc[azure] for Azure, dvc[ssh] for SSH
# Initialize DVC in an existing Git repo
cd my-ml-project
dvc init
git commit -m "Initialize DVC"
# Configure remote storage (S3)
dvc remote add -d myremote s3://my-ml-datasets/dvc-cache
# -d = set as default remote
# Or for local team sharing via SSH:
dvc remote add -d myremote ssh://shared-server:/data/dvc-cache
# Verify configuration
cat .dvc/config
# [core]
# remote = myremote
# ['remote "myremote"']
# url = s3://my-ml-datasets/dvc-cache3. Versioning Datasets
# Track a large dataset with DVC
dvc add data/train.csv
# Creates: data/train.csv.dvc
# Adds: data/train.csv to .gitignore (so Git ignores the actual file)
# Commit the pointer to Git
git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset v1"
# Push data to S3 (one-time, or when data changes)
dvc push
# ✓ Transferred: data/train.csv → s3://my-ml-datasets/dvc-cache/a1/b2...
# --- Three months later, colleague clones the repo ---
git clone https://github.com/org/my-ml-project
cd my-ml-project
dvc pull
# ✓ Downloaded: data/train.csv (10GB) — exact same version used in original training
# When you update your dataset:
# (Modify data/train.csv)
dvc add data/train.csv # Recomputes hash, updates .dvc file
git commit -a -m "Dataset v2 — added 50k new samples"
dvc push # Uploads new version to S3
# Roll back to dataset v1:
git checkout HEAD~1 -- data/train.csv.dvc
dvc checkout # Restores the v1 data from cache/S34. Reproducible ML Pipelines (dvc.yaml)
DVC's pipeline feature defines ML stages as a DAG — it knows which stages to re-run when inputs change:
# dvc.yaml — define your ML pipeline
stages:
preprocess:
cmd: python preprocess.py --input data/raw.csv --output data/clean.csv
deps:
- data/raw.csv # If this changes, re-run preprocess
- preprocess.py # If this changes, re-run preprocess
outs:
- data/clean.csv # This stage produces clean.csv
train:
cmd: python train.py --data data/clean.csv --output model/
deps:
- data/clean.csv # Depends on preprocess output
- train.py
- params.yaml # Hyperparameters file
outs:
- model/best.pkl
metrics:
- metrics/eval.json: # Track metrics as DVC artifacts
cache: false # Don't cache metrics in DVC remote
evaluate:
cmd: python evaluate.py --model model/best.pkl --output metrics/eval.json
deps:
- model/best.pkl
- evaluate.py
metrics:
- metrics/eval.json: { cache: false }
# Run the full pipeline (only re-runs changed stages)
dvc repro
# DVC checks: has data/raw.csv changed? No → skip preprocess
# DVC checks: has train.py changed? Yes → re-run train + evaluate
# Compare metrics across runs
dvc params diff HEAD~1 # Show hyperparameter changes
dvc metrics diff HEAD~1 # Show metric changes (accuracy, F1, etc.)5. Experiment Tracking with DVC
# params.yaml — your hyperparameters file (tracked by DVC)
model:
learning_rate: 0.001
batch_size: 32
epochs: 100
architecture: resnet50
# Run an experiment with different hyperparameters
dvc exp run --set-param model.learning_rate=0.0001
# List all experiments
dvc exp show
# ┌──────────────┬──────────────┬───────────────┬──────────┬───────┐
# │ Experiment │ Created │ learning_rate │ accuracy │ f1 │
# ├──────────────┼──────────────┼───────────────┼──────────┼───────┤
# │ workspace │ 2024-01-15 │ 0.001 │ 0.923 │ 0.918 │
# │ exp-a1b2c3 │ 2024-01-14 │ 0.0001 │ 0.915 │ 0.920 │
# │ exp-d4e5f6 │ 2024-01-13 │ 0.01 │ 0.891 │ 0.887 │
# └──────────────┴──────────────┴───────────────┴──────────┴───────┘
# Push the best experiment to Git
dvc exp branch exp-a1b2c3 feature/better-lr
git checkout feature/better-lr
git push6. DVC vs Git LFS vs MLflow — When to Use Each
| Tool | Best For | Limitations |
|---|---|---|
| DVC | Data + pipeline versioning, experiment tracking, CI/CD for ML | No UI (CLI-only), requires separate S3/GCS setup |
| Git LFS | Simple binary file versioning in existing Git workflows | Storage limits (GitHub charges for LFS bandwidth), no pipeline support |
| MLflow | Experiment metrics, model registry, artifact logging during training | Not designed for dataset versioning, no pipeline DAG support |
| HuggingFace Hub | Public dataset sharing, model hosting, community datasets | Less suited for internal/private datasets workflows |
Frequently Asked Questions
Can DVC work without a remote storage backend?
Yes — DVC works locally with just a local cache (.dvc/cache/). But without a remote, you can't share data with teammates or access it from other machines. For team use, even a simple NFS/SSH shared drive works as a DVC remote. S3 is recommended for production teams because it's durable, scalable, and cheap ($0.023/GB/month).
How does DVC handle very large model checkpoints (100GB+)?
DVC handles arbitrarily large files — it chunks them for upload. For very large files, use dvc add --to-remote which bypasses the local cache entirely and uploads directly to S3. Set DVC's upload concurrency with dvc config core.jobs 8 to parallelize multi-part S3 uploads.
Conclusion
DVC brings the same reproducibility guarantees to ML data that Git brings to code. When you can answer "which exact dataset version and hyperparameters produced the production model?" with a single git log command, debugging regressions and auditing model behavior becomes dramatically easier. The pipeline feature (dvc.yaml) is particularly valuable for teams: it documents the entire transformation chain from raw data to trained model, and automatically detects when a stage needs to be re-run based on dependency changes.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.