Stable Video 3D: 3D Reconstruction from Single Images
Dec 30, 2025 • 20 min read
Photogrammetry — capturing a 3D model by photographing an object from 50-200 angles — requires expensive equipment, controlled lighting, and hours of processing. Stable Video 3D (SV3D) replaces the photography step: it takes a single product photo and generates a consistent 360-degree orbital video that can then feed into photogrammetry or Gaussian Splatting pipelines. For e-commerce, game development, and product visualization, this workflow is transformative — turning any 2D photo library into a 3D asset catalog.
1. How SV3D Achieves Multi-View Consistency
Earlier single-image-to-3D models (like Zero123) had a critical flaw: they hallucinated the back of objects without maintaining 3D coherence. If you showed them a shoe's front, they might generate a back that looked like a completely different shoe — or even a different object. SV3D solves this by conditioning each generated frame on all previously generated frames through a temporal attention mechanism, ensuring geometric consistency across the full rotation:
- Frame-to-frame conditioning: Each of the 21 output frames attends to all previous frames via temporal cross-attention
- Camera pose embedding: Explicit camera pose (elevation + azimuth) is injected as conditioning, giving the model precise spatial understanding
- Founded on SVD: Built on Stable Video Diffusion, inheriting its powerful motion prior knowledge
2. SV3D_u vs SV3D_p: Choosing the Right Variant
- 21 frames of orbital video
- Fixed canonical arc: slight upward camera path
- No pose input required
- Best for: quick preview, e-commerce thumbnails
- Faster inference (~30s on A100)
- 21 frames with custom camera path
- Specify exact elevation + azimuth per frame
- Required for 3D reconstruction pipelines
- Best for: photogrammetry, NeRF/Gaussian Splatting
- Same inference time, better mesh quality
3. Running SV3D with Diffusers
pip install diffusers transformers accelerate torch
from diffusers import StableVideo3DPipeline
from diffusers.utils import load_image, export_to_video
import torch
import numpy as np
# === SV3D_u: Simple Unconditioned ===
pipe_u = StableVideo3DPipeline.from_pretrained(
"stabilityai/sv3d",
torch_dtype=torch.float16,
variant="fp16",
)
pipe_u.enable_model_cpu_offload()
# Load and resize input image (SV3D expects square 576x576)
from PIL import Image
image = load_image("./product_shoe.jpg").resize((576, 576))
# Generate orbital video (unconditioned)
frames = pipe_u(
image=image,
decode_chunk_size=8,
generator=torch.manual_seed(42),
num_frames=21, # 21 frames covering full 360 orbit
).frames[0]
export_to_video(frames, "shoe_orbit.mp4", fps=7)
print(f"Orbital video: {len(frames)} frames")
# === SV3D_p: Pose-Conditioned ===
pipe_p = StableVideo3DPipeline.from_pretrained(
"stabilityai/sv3d-p",
torch_dtype=torch.float16,
)
pipe_p.enable_model_cpu_offload()
# Define custom camera path for 360 orbit at 15° elevation
azimuths = np.linspace(0, 360, 21, endpoint=False) # 0 to 360° in 21 steps
elevations = np.full(21, 15.0) # Constant 15° elevation
frames_posed = pipe_p(
image=image,
poses=[{"elevation": e, "azimuth": a} for e, a in zip(elevations, azimuths)],
decode_chunk_size=8,
generator=torch.manual_seed(42),
).frames[0]
export_to_video(frames_posed, "shoe_orbit_posed.mp4", fps=7)
# This output is now suitable for feeding into Gaussian Splatting4. Full 3D Pipeline: Image → Video → Mesh
# Complete pipeline: single image → trained 3DGS → mesh export
import subprocess
import os
def image_to_3d_mesh(input_image_path: str, output_dir: str):
"""Full pipeline from single image to 3D mesh via SV3D + Gaussian Splatting."""
os.makedirs(output_dir, exist_ok=True)
# Step 1: Generate orbital video with SV3D_p
from diffusers import StableVideo3DPipeline
pipe = StableVideo3DPipeline.from_pretrained("stabilityai/sv3d-p", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
image = Image.open(input_image_path).resize((576, 576))
azimuths = np.linspace(0, 360, 21, endpoint=False)
elevations = np.full(21, 10.0) # Slight upward camera angle
frames = pipe(image=image, poses=[...]).frames[0]
# Step 2: Save frames as individual images with metadata
frames_dir = f"{output_dir}/frames"
os.makedirs(frames_dir, exist_ok=True)
for i, (frame, azimuth) in enumerate(zip(frames, azimuths)):
frame.save(f"{frames_dir}/frame_{i:04d}.png")
# Save camera pose alongside frame for Gaussian Splatting
# Step 3: Run Gaussian Splatting optimization
# Uses the known camera poses from Step 2 (no COLMAP needed!)
subprocess.run([
"python", "gaussian-splatting/train.py",
"-s", frames_dir, # Source: our SV3D frames
"--model_path", f"{output_dir}/gaussian_model",
"--iterations", "10000", # Fewer iterations since we have clean multi-view data
"--white_background", # Important for objects isolated from background
])
# Step 4: Convert Gaussian Splats to mesh
subprocess.run([
"python", "gaussian-splatting/extract_mesh.py",
"--model_path", f"{output_dir}/gaussian_model",
"--output", f"{output_dir}/mesh.obj",
])
print(f"3D mesh saved to {output_dir}/mesh.obj")
image_to_3d_mesh("product_shoe.jpg", "./output/shoe_3d/")5. Faster Alternative: TripoSR
# TripoSR (Stability AI + Tripo AI): 3D mesh in 0.5 seconds!
# No video generation step — directly outputs .obj and .glb files
# Lower quality than SV3D+Splatting but 100x faster
pip install trimesh rembg tsr
from tsr.system import TSR
from tsr.utils import remove_background, resize_foreground
model = TSR.from_pretrained("stabilityai/TripoSR")
model.renderer.set_chunk_size(8192)
model.to("cuda")
# Load and preprocess
image = Image.open("product.jpg").convert("RGBA")
image = remove_background(image) # Remove background
image = resize_foreground(image, ratio=0.85) # Resize to fill frame
# Generate 3D (runs in ~0.5s on RTX 4090!)
with torch.no_grad():
scene_codes = model([image], device="cuda")
# Export as .obj and .glb
meshes = model.extract_mesh(scene_codes, resolution=256)
meshes[0].export("output.obj")
meshes[0].export("output.glb") # Web-compatible format
# Compare: SV3D + Gaussian Splatting vs TripoSR
# SV3D pipeline: 5-10 min, high quality, good for hero assets
# TripoSR: 0.5 sec, medium quality, perfect for bulk/preview 3DFrequently Asked Questions
What types of objects work best with SV3D?
SV3D works best with isolated, solid objects on clean backgrounds: product photography (shoes, bags, electronics), architectural models, collectibles, and sculptures. It struggles with: transparent or highly reflective objects (glass, mirrors), thin structures (wire, hair, fabric), complex articulated objects (humans, animals with limbs), and scenes with depth (landscapes, interiors). Remove the background before input — SV3D performs significantly better on isolated objects than cluttered images.
How does SV3D compare to capturing real photogrammetry?
Real photogrammetry with a camera rig and RealityCapture/Meshroom still produces higher-quality meshes for hero assets — more geometric detail, accurate textures, correct physical scale. SV3D excels for: bulk conversion of existing 2D product catalogs to 3D, rapid prototyping when you don't have the object physically, and generating 3D previews before committing to a physical photo session. The quality gap narrows significantly for objects under 15cm.
Conclusion
SV3D bridges the gap between 2D product photography and 3D asset creation. The pose-conditioned SV3D_p variant generates geometrically consistent multi-view images that Gaussian Splatting can reconstruct into full 3D models — turning a single product photo into a rotatable 3D asset for e-commerce, AR, or game engines. For speed over quality, TripoSR offers sub-second 3D generation from single images. Together, these tools make 3D content creation accessible to any team with a 2D photo library.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.