⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

DVC: Git LFS on Steroids

Dec 30, 2025 • 20 min read

You version your code with Git. Your datasets live in a shared Dropbox folder with filenames like train_data_final_v3_FIXED_ACTUALLY_FINAL.csv. Every time someone runs training, they're not sure which version of the data produced the current production model. DVC (Data Version Control) solves this by treating datasets and models as first-class versioned artifacts — using Git for metadata pointers while storing actual data in S3, GCS, or Azure Blob Storage.

1. How DVC Works

DVC doesn't put your 10GB dataset into Git. Instead, it creates a small .dvc metadata file (like a symlink) that Git tracks, while the actual data lives in a remote storage backend:

# data/train.csv.dvc — what Git actually sees:
outs:
- md5: a1b2c3d4e5f6...    # Hash of the file content
  size: 10737418240       # 10 GB
  path: train.csv

# The actual train.csv is in:
# - Your local cache: .dvc/cache/a1/b2c3d4e5f6...
# - Remote storage: s3://my-bucket/dvcstore/a1/b2c3d4e5f6...
# 
# When a colleague does 'dvc pull', DVC downloads from S3 and
# places the file at data/train.csv — exact correct version, every time

2. Installation and Setup

pip install dvc[s3]    # For S3 storage backend
# Also available: dvc[gs] for GCS, dvc[azure] for Azure, dvc[ssh] for SSH

# Initialize DVC in an existing Git repo
cd my-ml-project
dvc init
git commit -m "Initialize DVC"

# Configure remote storage (S3)
dvc remote add -d myremote s3://my-ml-datasets/dvc-cache
# -d = set as default remote
# Or for local team sharing via SSH:
dvc remote add -d myremote ssh://shared-server:/data/dvc-cache

# Verify configuration
cat .dvc/config
# [core]
#     remote = myremote
# ['remote "myremote"']
#     url = s3://my-ml-datasets/dvc-cache

3. Versioning Datasets

# Track a large dataset with DVC
dvc add data/train.csv
# Creates: data/train.csv.dvc
# Adds:    data/train.csv to .gitignore (so Git ignores the actual file)

# Commit the pointer to Git
git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset v1"

# Push data to S3 (one-time, or when data changes)
dvc push
# ✓ Transferred: data/train.csv → s3://my-ml-datasets/dvc-cache/a1/b2...

# --- Three months later, colleague clones the repo ---
git clone https://github.com/org/my-ml-project
cd my-ml-project
dvc pull
# ✓ Downloaded: data/train.csv (10GB) — exact same version used in original training

# When you update your dataset:
# (Modify data/train.csv)
dvc add data/train.csv         # Recomputes hash, updates .dvc file
git commit -a -m "Dataset v2 — added 50k new samples"
dvc push                        # Uploads new version to S3

# Roll back to dataset v1:
git checkout HEAD~1 -- data/train.csv.dvc
dvc checkout               # Restores the v1 data from cache/S3

4. Reproducible ML Pipelines (dvc.yaml)

DVC's pipeline feature defines ML stages as a DAG — it knows which stages to re-run when inputs change:

# dvc.yaml — define your ML pipeline
stages:
  preprocess:
    cmd: python preprocess.py --input data/raw.csv --output data/clean.csv
    deps:
      - data/raw.csv          # If this changes, re-run preprocess
      - preprocess.py         # If this changes, re-run preprocess
    outs:
      - data/clean.csv        # This stage produces clean.csv
  
  train:
    cmd: python train.py --data data/clean.csv --output model/
    deps:
      - data/clean.csv        # Depends on preprocess output
      - train.py
      - params.yaml           # Hyperparameters file
    outs:
      - model/best.pkl
    metrics:
      - metrics/eval.json:    # Track metrics as DVC artifacts
          cache: false        # Don't cache metrics in DVC remote
  
  evaluate:
    cmd: python evaluate.py --model model/best.pkl --output metrics/eval.json
    deps:
      - model/best.pkl
      - evaluate.py
    metrics:
      - metrics/eval.json: { cache: false }

# Run the full pipeline (only re-runs changed stages)
dvc repro
# DVC checks: has data/raw.csv changed? No → skip preprocess
# DVC checks: has train.py changed? Yes → re-run train + evaluate

# Compare metrics across runs
dvc params diff HEAD~1          # Show hyperparameter changes
dvc metrics diff HEAD~1         # Show metric changes (accuracy, F1, etc.)

5. Experiment Tracking with DVC

# params.yaml — your hyperparameters file (tracked by DVC)
model:
  learning_rate: 0.001
  batch_size: 32
  epochs: 100
  architecture: resnet50

# Run an experiment with different hyperparameters
dvc exp run --set-param model.learning_rate=0.0001

# List all experiments
dvc exp show
# ┌──────────────┬──────────────┬───────────────┬──────────┬───────┐
# │ Experiment   │ Created      │ learning_rate │ accuracy │ f1    │
# ├──────────────┼──────────────┼───────────────┼──────────┼───────┤
# │ workspace    │ 2024-01-15   │ 0.001         │ 0.923    │ 0.918 │
# │ exp-a1b2c3   │ 2024-01-14   │ 0.0001        │ 0.915    │ 0.920 │
# │ exp-d4e5f6   │ 2024-01-13   │ 0.01          │ 0.891    │ 0.887 │
# └──────────────┴──────────────┴───────────────┴──────────┴───────┘

# Push the best experiment to Git
dvc exp branch exp-a1b2c3 feature/better-lr
git checkout feature/better-lr
git push

6. DVC vs Git LFS vs MLflow — When to Use Each

Tool	Best For	Limitations
DVC	Data + pipeline versioning, experiment tracking, CI/CD for ML	No UI (CLI-only), requires separate S3/GCS setup
Git LFS	Simple binary file versioning in existing Git workflows	Storage limits (GitHub charges for LFS bandwidth), no pipeline support
MLflow	Experiment metrics, model registry, artifact logging during training	Not designed for dataset versioning, no pipeline DAG support
HuggingFace Hub	Public dataset sharing, model hosting, community datasets	Less suited for internal/private datasets workflows

Frequently Asked Questions

Can DVC work without a remote storage backend?

Yes — DVC works locally with just a local cache (.dvc/cache/). But without a remote, you can't share data with teammates or access it from other machines. For team use, even a simple NFS/SSH shared drive works as a DVC remote. S3 is recommended for production teams because it's durable, scalable, and cheap ($0.023/GB/month).

How does DVC handle very large model checkpoints (100GB+)?

DVC handles arbitrarily large files — it chunks them for upload. For very large files, use dvc add --to-remote which bypasses the local cache entirely and uploads directly to S3. Set DVC's upload concurrency with dvc config core.jobs 8 to parallelize multi-part S3 uploads.

Conclusion

DVC brings the same reproducibility guarantees to ML data that Git brings to code. When you can answer "which exact dataset version and hyperparameters produced the production model?" with a single git log command, debugging regressions and auditing model behavior becomes dramatically easier. The pipeline feature (dvc.yaml) is particularly valuable for teams: it documents the entire transformation chain from raw data to trained model, and automatically detects when a stage needs to be re-run based on dependency changes.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact