opncrafter

Stable Video 3D: 3D Reconstruction from Single Images

Dec 30, 2025 • 20 min read

Photogrammetry — capturing a 3D model by photographing an object from 50-200 angles — requires expensive equipment, controlled lighting, and hours of processing. Stable Video 3D (SV3D) replaces the photography step: it takes a single product photo and generates a consistent 360-degree orbital video that can then feed into photogrammetry or Gaussian Splatting pipelines. For e-commerce, game development, and product visualization, this workflow is transformative — turning any 2D photo library into a 3D asset catalog.

1. How SV3D Achieves Multi-View Consistency

Earlier single-image-to-3D models (like Zero123) had a critical flaw: they hallucinated the back of objects without maintaining 3D coherence. If you showed them a shoe's front, they might generate a back that looked like a completely different shoe — or even a different object. SV3D solves this by conditioning each generated frame on all previously generated frames through a temporal attention mechanism, ensuring geometric consistency across the full rotation:

  • Frame-to-frame conditioning: Each of the 21 output frames attends to all previous frames via temporal cross-attention
  • Camera pose embedding: Explicit camera pose (elevation + azimuth) is injected as conditioning, giving the model precise spatial understanding
  • Founded on SVD: Built on Stable Video Diffusion, inheriting its powerful motion prior knowledge

2. SV3D_u vs SV3D_p: Choosing the Right Variant

SV3D_u (Unconditioned)
  • 21 frames of orbital video
  • Fixed canonical arc: slight upward camera path
  • No pose input required
  • Best for: quick preview, e-commerce thumbnails
  • Faster inference (~30s on A100)
SV3D_p (Pose Conditioned)
  • 21 frames with custom camera path
  • Specify exact elevation + azimuth per frame
  • Required for 3D reconstruction pipelines
  • Best for: photogrammetry, NeRF/Gaussian Splatting
  • Same inference time, better mesh quality

3. Running SV3D with Diffusers

pip install diffusers transformers accelerate torch

from diffusers import StableVideo3DPipeline
from diffusers.utils import load_image, export_to_video
import torch
import numpy as np

# === SV3D_u: Simple Unconditioned ===
pipe_u = StableVideo3DPipeline.from_pretrained(
    "stabilityai/sv3d",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe_u.enable_model_cpu_offload()

# Load and resize input image (SV3D expects square 576x576)
from PIL import Image
image = load_image("./product_shoe.jpg").resize((576, 576))

# Generate orbital video (unconditioned) 
frames = pipe_u(
    image=image,
    decode_chunk_size=8,
    generator=torch.manual_seed(42),
    num_frames=21,             # 21 frames covering full 360 orbit
).frames[0]

export_to_video(frames, "shoe_orbit.mp4", fps=7)
print(f"Orbital video: {len(frames)} frames")

# === SV3D_p: Pose-Conditioned ===
pipe_p = StableVideo3DPipeline.from_pretrained(
    "stabilityai/sv3d-p",
    torch_dtype=torch.float16,
)
pipe_p.enable_model_cpu_offload()

# Define custom camera path for 360 orbit at 15° elevation
azimuths = np.linspace(0, 360, 21, endpoint=False)  # 0 to 360° in 21 steps
elevations = np.full(21, 15.0)                       # Constant 15° elevation

frames_posed = pipe_p(
    image=image,
    poses=[{"elevation": e, "azimuth": a} for e, a in zip(elevations, azimuths)],
    decode_chunk_size=8,
    generator=torch.manual_seed(42),
).frames[0]

export_to_video(frames_posed, "shoe_orbit_posed.mp4", fps=7)
# This output is now suitable for feeding into Gaussian Splatting

4. Full 3D Pipeline: Image → Video → Mesh

# Complete pipeline: single image → trained 3DGS → mesh export
import subprocess
import os

def image_to_3d_mesh(input_image_path: str, output_dir: str):
    """Full pipeline from single image to 3D mesh via SV3D + Gaussian Splatting."""
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Step 1: Generate orbital video with SV3D_p
    from diffusers import StableVideo3DPipeline
    pipe = StableVideo3DPipeline.from_pretrained("stabilityai/sv3d-p", torch_dtype=torch.float16)
    pipe.enable_model_cpu_offload()
    
    image = Image.open(input_image_path).resize((576, 576))
    azimuths = np.linspace(0, 360, 21, endpoint=False)
    elevations = np.full(21, 10.0)  # Slight upward camera angle
    
    frames = pipe(image=image, poses=[...]).frames[0]
    
    # Step 2: Save frames as individual images with metadata
    frames_dir = f"{output_dir}/frames"
    os.makedirs(frames_dir, exist_ok=True)
    for i, (frame, azimuth) in enumerate(zip(frames, azimuths)):
        frame.save(f"{frames_dir}/frame_{i:04d}.png")
        # Save camera pose alongside frame for Gaussian Splatting
    
    # Step 3: Run Gaussian Splatting optimization
    # Uses the known camera poses from Step 2 (no COLMAP needed!)
    subprocess.run([
        "python", "gaussian-splatting/train.py",
        "-s", frames_dir,         # Source: our SV3D frames
        "--model_path", f"{output_dir}/gaussian_model",
        "--iterations", "10000",  # Fewer iterations since we have clean multi-view data
        "--white_background",     # Important for objects isolated from background
    ])
    
    # Step 4: Convert Gaussian Splats to mesh
    subprocess.run([
        "python", "gaussian-splatting/extract_mesh.py",
        "--model_path", f"{output_dir}/gaussian_model",
        "--output", f"{output_dir}/mesh.obj",
    ])
    
    print(f"3D mesh saved to {output_dir}/mesh.obj")

image_to_3d_mesh("product_shoe.jpg", "./output/shoe_3d/")

5. Faster Alternative: TripoSR

# TripoSR (Stability AI + Tripo AI): 3D mesh in 0.5 seconds!
# No video generation step — directly outputs .obj and .glb files
# Lower quality than SV3D+Splatting but 100x faster

pip install trimesh rembg tsr

from tsr.system import TSR
from tsr.utils import remove_background, resize_foreground

model = TSR.from_pretrained("stabilityai/TripoSR")
model.renderer.set_chunk_size(8192)
model.to("cuda")

# Load and preprocess
image = Image.open("product.jpg").convert("RGBA")
image = remove_background(image)           # Remove background
image = resize_foreground(image, ratio=0.85)  # Resize to fill frame

# Generate 3D (runs in ~0.5s on RTX 4090!)
with torch.no_grad():
    scene_codes = model([image], device="cuda")

# Export as .obj and .glb
meshes = model.extract_mesh(scene_codes, resolution=256)
meshes[0].export("output.obj")
meshes[0].export("output.glb")  # Web-compatible format

# Compare: SV3D + Gaussian Splatting vs TripoSR
# SV3D pipeline: 5-10 min, high quality, good for hero assets
# TripoSR:       0.5 sec,  medium quality, perfect for bulk/preview 3D

Frequently Asked Questions

What types of objects work best with SV3D?

SV3D works best with isolated, solid objects on clean backgrounds: product photography (shoes, bags, electronics), architectural models, collectibles, and sculptures. It struggles with: transparent or highly reflective objects (glass, mirrors), thin structures (wire, hair, fabric), complex articulated objects (humans, animals with limbs), and scenes with depth (landscapes, interiors). Remove the background before input — SV3D performs significantly better on isolated objects than cluttered images.

How does SV3D compare to capturing real photogrammetry?

Real photogrammetry with a camera rig and RealityCapture/Meshroom still produces higher-quality meshes for hero assets — more geometric detail, accurate textures, correct physical scale. SV3D excels for: bulk conversion of existing 2D product catalogs to 3D, rapid prototyping when you don't have the object physically, and generating 3D previews before committing to a physical photo session. The quality gap narrows significantly for objects under 15cm.

Conclusion

SV3D bridges the gap between 2D product photography and 3D asset creation. The pose-conditioned SV3D_p variant generates geometrically consistent multi-view images that Gaussian Splatting can reconstruct into full 3D models — turning a single product photo into a rotatable 3D asset for e-commerce, AR, or game engines. For speed over quality, TripoSR offers sub-second 3D generation from single images. Together, these tools make 3D content creation accessible to any team with a 2D photo library.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK