Generative Media Workflows
Dec 30, 2025 • 22 min read
Building with text generation alone misses the majority of creative and commercial value in generative AI. The real opportunity is multimodal automation: pipelines that combine language models for scripting, image generation for visual assets, video models for motion, voice synthesis for narration, and video editing for assembly — all orchestrated programmatically. This guide covers how to build and deploy these generative media workflows, from ComfyUI node design to production GPU backends.
1. ComfyUI: Backend, Not Frontend
Using Midjourney through Discord makes you a consumer. Understanding how tensors flow through the diffusion U-Net makes you an engineer. ComfyUI is the tool that bridges these worlds: a node-based graph editor for building explicit Stable Diffusion pipelines where every component is visible and controllable.
- Nodes and edges: Each operation (checkpoint loading, CLIP encoding, VAE decoding, KSampler) is an explicit node. You wire them together, seeing exactly how data flows
- Reproducibility: A JSON workflow file precisely captures every parameter — share a workflow and anyone can reproduce your exact output
- API mode: Run your workflow via WebSocket or REST API from Python — the killer feature for building SaaS products
- Embedded metadata: ComfyUI embeds the full workflow JSON into generated PNG files — drag an image back into ComfyUI to instantly restore the workflow that created it
2. The Full AI Video Production Pipeline
- Script: GPT-4o generates a structured 30-second commercial script with scene descriptions, dialogue, and shot composition
- Character Design: SDXL + IP-Adapter generates consistent character images from reference photos across all scenes
- Pose Control: ControlNet OpenPose ensures characters appear in the correct positions per script stage directions
- Animation: Runway Gen-3, Kling, or Luma Dream Machine animates the static character images (image-to-video)
- Voice Generation: ElevenLabs or Cartesia synthesizes script dialogue in the character's voice
- Lip Sync: Synclabs or Wav2Lip syncs the audio to visible character mouths
- Assembly: FFmpeg concatenates scenes, mixes audio, adds background music from MusicGen
# Orchestrating a complete video generation pipeline
import asyncio
from openai import OpenAI
import requests
import subprocess
client = OpenAI()
async def generate_promotional_video(product_name: str, target_audience: str):
"""End-to-end AI video production pipeline."""
# Step 1: Script generation
script_response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Generate a 30-second promotional video script for {product_name} targeting {target_audience}.
Format as JSON with: scenes (list of {{description, dialogue, duration_s, shot_type}}).
Total scenes should sum to 30 seconds."""
}],
response_format={"type": "json_object"},
)
script = json.loads(script_response.choices[0].message.content)
# Step 2: Generate character/product images for each scene
scene_images = []
for scene in script["scenes"]:
# Use ComfyUI API to run our pre-built character generation workflow
comfy_response = requests.post(
"http://localhost:8188/prompt",
json={
"prompt": {
"3": {"inputs": {"text": scene["description"]}}, # CLIP Text Encode node
"4": {"inputs": {"seed": 42, "steps": 30, "cfg": 7.0}}, # KSampler
}
}
)
# Poll for completion and download image...
scene_images.append(download_comfy_result(comfy_response.json()["prompt_id"]))
# Step 3: Voice synthesis
audio_files = []
for scene in script["scenes"]:
if scene.get("dialogue"):
tts_response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input=scene["dialogue"],
)
audio_path = f"audio_scene_{len(audio_files)}.mp3"
with open(audio_path, "wb") as f:
f.write(tts_response.content)
audio_files.append(audio_path)
# Step 4: Animate images to video (Runway Gen-3 API)
video_clips = []
for image_path in scene_images:
# Call Runway ML Gen-3 API for image-to-video
clip_path = runway_image_to_video(image_path, duration=5)
video_clips.append(clip_path)
# Step 5: Assemble with FFmpeg
with open("concat_list.txt", "w") as f:
for clip in video_clips:
f.write(f"file '{clip}'
")
subprocess.run([
"ffmpeg", "-f", "concat", "-safe", "0",
"-i", "concat_list.txt",
"-c:v", "copy",
"final_video.mp4"
])
return "final_video.mp4"3. Production Backend: Modal and RunPod
# Modal.com: Serverless GPU inference — pay per second, scale to zero
# No GPU server to manage, no idle costs
import modal
import io
app = modal.App("comfyui-backend")
# Docker image with ComfyUI and models pre-downloaded
image = (
modal.Image.debian_slim()
.pip_install("comfyui>=0.3.3", "torch", "torchvision")
.run_commands(
"git clone https://github.com/comfyanonymous/ComfyUI /app/ComfyUI",
# Download model weights into the image (cached between deployments)
"wget https://huggingface.co/.../sdxl-base.safetensors -P /app/ComfyUI/models/checkpoints/",
)
)
@app.function(
image=image,
gpu="A10G", # $0.28/min — scales to zero between requests
timeout=300,
scoped_volumes={"/cache": modal.Volume.from_name("model-cache")}
)
def run_comfyui_workflow(workflow_json: dict, prompt_overrides: dict) -> bytes:
"""Run a ComfyUI workflow and return the generated image as bytes."""
import subprocess
import json
import time
from pathlib import Path
# Inject user-provided prompt into the workflow
for node_id, overrides in prompt_overrides.items():
if node_id in workflow_json:
workflow_json[node_id]["inputs"].update(overrides)
# Start ComfyUI server
proc = subprocess.Popen(["python", "/app/ComfyUI/main.py", "--port", "8188", "--listen"])
time.sleep(5) # Wait for server startup
# Submit workflow
import requests
response = requests.post("http://localhost:8188/prompt", json={"prompt": workflow_json})
prompt_id = response.json()["prompt_id"]
# Poll for completion
while True:
history = requests.get(f"http://localhost:8188/history/{prompt_id}").json()
if prompt_id in history:
break
time.sleep(1)
# Return image bytes
output_node_id = list(history[prompt_id]["outputs"].keys())[0]
filename = history[prompt_id]["outputs"][output_node_id]["images"][0]["filename"]
image_data = requests.get(f"http://localhost:8188/view?filename={filename}").content
proc.terminate()
return image_data
# Deploy: modal deploy comfyui_backend.py
# Then call from your web app:
# bytes = run_comfyui_workflow.remote(workflow_json, {"6": {"inputs": {"text": user_prompt}}})Frequently Asked Questions
What's the difference between Runway, Kling, and Luma for video generation?
Runway Gen-3: Best for cinematic quality, strong director-style prompt following ("dolly zoom", "rack focus"), $15/month minimum, API available. Kling: Excellent motion quality, competitive pricing, strong at character animation, available via Fal.ai API. Luma Dream Machine: Best at realistic physics and natural motion, great for product visualization, slightly cheaper per generation. For automated pipelines: all three have APIs — test each for your specific content type as quality varies significantly by subject matter (characters vs landscapes vs products).
How do I maintain character consistency across multiple generations?
IP-Adapter is the standard solution: provide a reference image of your character, and IP-Adapter injects the visual features into the cross-attention layers, conditioning every generation toward that appearance. For stronger consistency, combine IP-Adapter with a character LoRA. For commercial workflows: Midjourney's "Character Reference" (\u002d\u002dcref) and Flux's "Character consistency" features simplify this without requiring ComfyUI setup. Always save a "canonical" character reference image and use it as input for every scene.
Conclusion
Generative media workflows are the frontier of applied AI engineering in 2025. The tools exist to automate the entire video production pipeline — from scripted concept to finished video — with minimal human intervention. ComfyUI provides the fine-grained control needed for reproducible image generation; video models (Runway, Kling) handle animation; voice synthesis and lip sync complete the production. Deploying on serverless GPU infrastructure (Modal, RunPod) keeps costs proportional to usage. The competitive advantage goes to engineers who can orchestrate these components reliably at production quality and scale.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.