opncrafter

ControlNet: Taking the Wheel

Dec 30, 2025 • 20 min read

Pure text-to-image generation gives you no spatial control. Asking Stable Diffusion to "make the person stand on the left with their arm raised" results in correct-keyword images where the composition is entirely random. Professional artists and product teams need precise layout control — specific object positions, character poses, room layouts, and atmospheric moods — without retraining the model from scratch. ControlNet adds a parallel "steering" network trained to accept structural signals: Canny edges, depth maps, human skeleton poses, normal maps, and more.

1. The Zero Convolution Architecture

ControlNet's elegant design creates a trainable copy of Stable Diffusion's encoder blocks while keeping the original model frozen. The critical innovation is the "zero convolution" layers connecting the ControlNet copy to the main model:

  • Frozen base model: The original SD weights remain unchanged — ControlNet doesn't degrade the base model's knowledge
  • Trainable encoder copy: A full copy of the SD UNet encoder is trained alongside the control signal processing
  • Zero convolutions: 1×1 convolutions initialized to zero weight and bias. At training step 0, ControlNet contributes zero influence. As training progresses, the network learns how much and where to steer diffusion
  • Residual connection: ControlNet's outputs are added to the corresponding UNet decoder layers — it's additive, not replacing

This design means: ControlNet can be attached to any Stable Diffusion 1.5 checkpoint without compatibility issues, and you can train a new ControlNet on new structural signals without modifying the base model.

2. Control Processors: What Can You Control?

ProcessorInputExtractsBest For
Canny EdgeAny imageHard edge contoursLine art, logos, architecture, product outlines
Soft Edge (HED)Any imageSoft/blurry edgesNatural objects, portraits (less rigid than Canny)
Depth MapAny image3D depth via Midas/ZoeInterior design, scene layout preservation
OpenPoseHuman photo/video18 skeleton keypointsCharacter pose control, pose transfer
Normal Map3D renderSurface normalsConsistent lighting/materials on 3D surfaces
ScribbleHand-drawn sketchRough outlinesConcept art from rough sketch input
IP-AdapterReference imageStyle/subject featuresStyle transfer, consistent character appearance
TileAny imageTile structureImage upscaling with detail enhancement

3. Python Implementation with Diffusers

pip install diffusers controlnet-aux torch Pillow

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from controlnet_aux import CannyDetector, MidasDetector, OpenposeDetector
from PIL import Image
import torch
import numpy as np

# ========== CANNY EDGE EXAMPLE ==========

# 1. Load ControlNet (Canny)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

# 2. Load Pipeline with any SD 1.5 checkpoint
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
    safety_checker=None,
)
# UniPC scheduler: faster convergence, better for ControlNet
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# 3. Extract Canny edges from reference image
image = Image.open("product_sketch.png").convert("RGB").resize((512, 512))
canny = CannyDetector()
control_image = canny(image, low_threshold=100, high_threshold=200)
# control_image: black background, white lines at edges

# 4. Generate conditioned on edges
output = pipe(
    prompt="high-quality product photo of a sports shoe, studio lighting, white background",
    negative_prompt="blurry, distorted, watermark, low quality, cartoon",
    image=control_image,
    num_inference_steps=30,            # More steps = better quality
    guidance_scale=7.5,
    controlnet_conditioning_scale=0.8, # How strongly to follow edges (0.0-1.5)
    generator=torch.manual_seed(42),
).images[0]

output.save("product_controlled.png")

4. OpenPose: Character Pose Control

# Pose transfer: take the exact pose from a reference photo,
# apply it to an entirely different character/style

openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")

# Extract skeleton from reference photo
reference_photo = Image.open("model_pose.jpg").convert("RGB")
pose_image = openpose(reference_photo, hand_and_face=True)  # Include hands

controlnet_pose = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16,
)
pipe_pose = StableDiffusionControlNetPipeline.from_pretrained(
    "dreamshaper/dreamshaper-8",  # Stylized base model
    controlnet=controlnet_pose,
    torch_dtype=torch.float16,
)
pipe_pose.enable_model_cpu_offload()

# Generate character in extracted pose
output = pipe_pose(
    prompt="fantasy warrior with armor plating, epic cinematic lighting, detailed illustration",
    negative_prompt="modern clothes, casual, realistic photo",
    image=pose_image,
    num_inference_steps=30,
    guidance_scale=7.5,
    controlnet_conditioning_scale=1.0,  # Full pose adherence
).images[0]
# Result: your fantasy warrior exactly replicates the model's pose

5. Multi-ControlNet: Stack Multiple Controls

# Combine Canny + Depth + Pose simultaneously for maximum control
controlnets = [
    ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16),
    ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16),
    ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16),
]

pipe_multi = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnets,  # Pass a list
    torch_dtype=torch.float16,
)
pipe_multi.enable_model_cpu_offload()

# Prepare control images for each ControlNet
canny_image = canny(reference_image)
depth_image = MidasDetector.from_pretrained("lllyasviel/Annotators")(reference_image)
pose_image = openpose(reference_image)

output = pipe_multi(
    prompt="...",
    image=[canny_image, depth_image, pose_image],   # One per ControlNet
    controlnet_conditioning_scale=[0.6, 0.4, 0.9],  # Different strength per ControlNet
    num_inference_steps=30,
).images[0]
# Canny controls outline structure, depth controls 3D layout, pose controls character

Frequently Asked Questions

What's the difference between contronet_conditioning_scale 0.5 vs 1.5?

At 0.5, the control signal guides the generation softly — the model follows the structure loosely while having creative freedom to deviate. Good for artistic/stylized outputs where exact structure isn't critical. At 1.0, the control signal has strong influence — the model closely follows the edge map or pose skeleton. At 1.5+, the control is very strong and can create artifacts as the model struggles to simultaneously follow both the text prompt and overly constraining structural signal. For most use cases, 0.7-0.9 for Canny and 0.9-1.0 for OpenPose work well.

Does ControlNet work with SDXL models?

Yes — separate SDXL ControlNet models exist: diffusers/controlnet-canny-sdxl-1.0, diffusers/controlnet-depth-sdxl-1.0. Use StableDiffusionXLControlNetPipeline instead of StableDiffusionControlNetPipeline. SDXL ControlNet produces higher-resolution outputs (1024x1024 native) but requires significantly more VRAM (~16GB for SDXL ControlNet vs ~6GB for SD1.5 ControlNet). T2I-Adapter is a lighter alternative with similar control quality at lower VRAM requirements.

Conclusion

ControlNet transforms Stable Diffusion from a text-following model into a precise composition tool. Canny edge control preserves product outlines and architectural structures. OpenPose enables exact character pose replication for consistent character generation across a series. Depth maps maintain 3D layout when restyling interior photos. Stack multiple ControlNets for complex scenes requiring simultaneous structural and pose control. For production workflows, IP-Adapter combined with ControlNet provides both pose control and consistent character appearance — the combination used in most professional AI content pipelines.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK