ComfyUI: Programming with Nodes
Dec 30, 2025 • 22 min read
Automatic1111 is a user interface. ComfyUI is an engineering environment. Both use the same Stable Diffusion models, but the experience is fundamentally different: A1111 gives you sliders and a Generate button; ComfyUI gives you explicit data flow between every component of the pipeline. When something goes wrong in A1111, you adjust sliders and hope. In ComfyUI, you can inspect the tensor at any node, swap individual components, and wire up algorithms that A1111's interface can't express. If you want to understand what's actually happening inside the diffusion process, ComfyUI is the only tool that shows you.
1. The Core Concepts: What You're Actually Wiring
Every Stable Diffusion inference process involves these stages, which ComfyUI makes explicit:
- Checkpoint Loader: Loads the model weights (.safetensors) — gives you the U-Net, CLIP text encoder, and VAE decoder as separate outputs you can route independently
- CLIP Text Encode: Converts your prompt string into a conditioning tensor using the CLIP model — you can wire different prompts to positive/negative conditioning separately
- VAE Encode/Decode: Converts between pixel space (images) and latent space (compressed representation where diffusion happens)
- Empty Latent Image: Creates a noise tensor of your target resolution to start denoising from
- KSampler: The denoising loop — takes latent noise + conditioning + model, runs N steps of denoising, outputs a cleaner latent
Understanding this data flow is what separates beginner from advanced image generation — every "magic trick" (HiRes Fix, img2img, inpainting, ControlNet) is just novel wiring of these components.
2. HiRes Fix: The Full Workflow Deconstructed
# ComfyUI HiRes Fix Workflow in Python (via API)
# Equivalent to clicking "Hires Fix" in A1111, but you control every step
# The "Hires Fix" algorithm:
# 1. Generate at native resolution (512x512 for SD1.5, 1024x1024 for SDXL)
# → Good composition but limited detail at high resolutions
# 2. Decode latent → pixel image (temporarily in pixel space)
# 3. Upscale pixels 2x using ESRGAN (AI upscaler, not bilinear)
# 4. Re-encode upscaled pixels → latent space
# 5. Run KSampler again at LOW denoising (0.4-0.6)
# → Adds detail without destroying composition established in step 1
# ComfyUI JSON workflow structure (simplified):
workflow = {
"1": { # CheckpointLoaderSimple
"class_type": "CheckpointLoaderSimple",
"inputs": {"ckpt_name": "sdxl_base_1.0.safetensors"},
},
"2": { # CLIPTextEncode (Positive)
"class_type": "CLIPTextEncode",
"inputs": {
"text": "A beautiful landscape with mountains, highly detailed, 8k",
"clip": ["1", 1], # ["source_node_id", output_slot_index]
}
},
"3": { # CLIPTextEncode (Negative)
"class_type": "CLIPTextEncode",
"inputs": {
"text": "blur, low quality, watermark, text",
"clip": ["1", 1],
}
},
"4": { # EmptyLatentImage (Step 1: native resolution)
"class_type": "EmptyLatentImage",
"inputs": {"width": 1024, "height": 1024, "batch_size": 1},
},
"5": { # KSampler (First pass: composition)
"class_type": "KSampler",
"inputs": {
"model": ["1", 0],
"positive": ["2", 0],
"negative": ["3", 0],
"latent_image": ["4", 0],
"seed": 42,
"steps": 30,
"cfg": 7.0,
"sampler_name": "dpmpp_2m",
"scheduler": "karras",
"denoise": 1.0, # Full denoising from pure noise
}
},
"6": { # VAEDecode (Step 2: to pixel space)
"class_type": "VAEDecode",
"inputs": {"samples": ["5", 0], "vae": ["1", 2]},
},
"7": { # ImageUpscaleWithModel (Step 3: ESRGAN upscale 2x)
"class_type": "ImageUpscaleWithModel",
"inputs": {"upscale_model": ["8", 0], "image": ["6", 0]},
},
"8": { # UpscaleModelLoader
"class_type": "UpscaleModelLoader",
"inputs": {"model_name": "RealESRGAN_x2plus.pth"},
},
"9": { # VAEEncode (Step 4: back to latent space)
"class_type": "VAEEncode",
"inputs": {"pixels": ["7", 0], "vae": ["1", 2]},
},
"10": { # KSampler (Step 5: second pass for detail)
"class_type": "KSampler",
"inputs": {
"model": ["1", 0],
"positive": ["2", 0],
"negative": ["3", 0],
"latent_image": ["9", 0],
"seed": 42, "steps": 20, "cfg": 7.0,
"sampler_name": "dpmpp_2m", "scheduler": "karras",
"denoise": 0.5, # CRITICAL: Only 50% denoising — keeps composition!
}
},
"11": { # Final VAEDecode
"class_type": "VAEDecode",
"inputs": {"samples": ["10", 0], "vae": ["1", 2]},
},
"12": { # SaveImage
"class_type": "SaveImage",
"inputs": {"images": ["11", 0], "filename_prefix": "hires_fix_"}
}
}3. Calling ComfyUI via Python API
import requests
import websocket
import json
import uuid
import time
from pathlib import Path
COMFY_URL = "http://localhost:8188"
def generate_image(workflow: dict, prompt_text: str) -> bytes:
"""Submit workflow to ComfyUI and return generated image bytes."""
# Inject prompt into the workflow (find the text encode node)
# In this workflow, node "2" is the positive CLIP text encode
workflow["2"]["inputs"]["text"] = prompt_text
# Generate unique client ID for this request
client_id = str(uuid.uuid4())
# Submit workflow
response = requests.post(
f"{COMFY_URL}/prompt",
json={"prompt": workflow, "client_id": client_id}
)
prompt_id = response.json()["prompt_id"]
# Poll for completion (or use WebSocket for real-time progress)
while True:
history_response = requests.get(f"{COMFY_URL}/history/{prompt_id}")
history = history_response.json()
if prompt_id in history:
break
time.sleep(0.5)
# Get output image
outputs = history[prompt_id]["outputs"]
first_output_node = list(outputs.keys())[0]
image_info = outputs[first_output_node]["images"][0]
# Download image bytes
image_response = requests.get(
f"{COMFY_URL}/view",
params={"filename": image_info["filename"], "type": image_info["type"]}
)
return image_response.content
# Usage
with open("hires_workflow.json") as f:
workflow = json.load(f) # Export from ComfyUI: Settings → Export (API format)
image_bytes = generate_image(workflow, "A futuristic city at sunset, cyberpunk aesthetic")
with open("output.png", "wb") as f:
f.write(image_bytes)4. IP-Adapter: Image Prompting for Character Consistency
# IP-Adapter injects reference image features into cross-attention layers
# Result: generated image inherits style/content from reference without LoRA training
# ComfyUI nodes needed (available via ComfyUI Manager):
# - IPAdapterModelLoader: loads ip-adapter weights
# - CLIPVisionLoader: loads CLIP visual encoder for image feature extraction
# - IPAdapter: applies the adapter to the model
workflow_ipadapter = {
# ... standard checkpoint/KSampler nodes ...
"20": { # Load IP-Adapter model
"class_type": "IPAdapterModelLoader",
"inputs": {"ipadapter_file": "ip-adapter-plus-face_sdxl.bin"},
},
"21": { # Load CLIP Vision encoder
"class_type": "CLIPVisionLoader",
"inputs": {"clip_name": "clip_vision_sdxl.safetensors"},
},
"22": { # Load your reference image
"class_type": "LoadImage",
"inputs": {"image": "reference_character.png"},
},
"23": { # Apply IP-Adapter to model
"class_type": "IPAdapter",
"inputs": {
"model": ["1", 0], # The base checkpoint model
"ipadapter": ["20", 0],
"image": ["22", 0], # Your reference character image
"clip_vision": ["21", 0],
"weight": 0.7, # 0.0=ignore ref, 1.0=strong adherence
"weight_type": "linear",
"combine_embeds": "concat",
}
},
# Wire node 23's output model → KSampler instead of node 1
# Every generation will now inherit visual features from reference_character.png
}Frequently Asked Questions
What's the difference between "API format" and "regular" workflow export?
ComfyUI has two export modes. Regular workflow export (Ctrl+S) saves the graph for the ComfyUI UI — includes node positions, display settings, and is human-readable but not directly usable for API calls. API format export (Settings → Enable Dev Mode, then Ctrl+Shift+E) exports a minimal JSON with just node configurations and connections, stripped of UI metadata — this is what you pass to the /prompt API endpoint. When building automated pipelines, always test in the ComfyUI UI first, then export as API format for your backend integration.
How do I handle memory issues when running multiple workflows concurrently?
ComfyUI queues workflows serially by default (one at a time). For concurrent serving: use multiple ComfyUI instances, each on a separate GPU port (8188, 8189, 8190...) and load-balance across them. For a single GPU, increase the queue size carefully with --max-queue-size but be aware that queued jobs consume ~250MB RAM per job in the queue. Use --lowvram or --medvram flags to enable model offloading between generations if you're VRAM-constrained, at the cost of 2-3x slower generation.
Conclusion
ComfyUI is the tool that separates image generation practitioners from engineers. By making every operation explicit — the VAE encode/decode cycle, the CLIP conditioning flow, the KSampler's denoising parameters — it provides complete control over the diffusion pipeline that no button-based UI can match. The API mode transforms ComfyUI from a creative tool into a production image generation backend, enabling SaaS products that use Stable Diffusion without exposing users to the underlying complexity. Master the HiRes Fix workflow, IP-Adapter for character consistency, and batch XYZ parameter testing, and you'll have the complete professional toolkit for generative image engineering.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.