⏱ 8–12 min read🎓 Beginner → IntermediateUpdated Apr 2026

Generative Media Workflows

Dec 30, 2025 • 22 min read

Building with text generation alone misses the majority of creative and commercial value in generative AI. The real opportunity is multimodal automation: pipelines that combine language models for scripting, image generation for visual assets, video models for motion, voice synthesis for narration, and video editing for assembly — all orchestrated programmatically. This guide covers how to build and deploy these generative media workflows, from ComfyUI node design to production GPU backends.

1. ComfyUI: Backend, Not Frontend

Using Midjourney through Discord makes you a consumer. Understanding how tensors flow through the diffusion U-Net makes you an engineer. ComfyUI is the tool that bridges these worlds: a node-based graph editor for building explicit Stable Diffusion pipelines where every component is visible and controllable.

Nodes and edges: Each operation (checkpoint loading, CLIP encoding, VAE decoding, KSampler) is an explicit node. You wire them together, seeing exactly how data flows
Reproducibility: A JSON workflow file precisely captures every parameter — share a workflow and anyone can reproduce your exact output
API mode: Run your workflow via WebSocket or REST API from Python — the killer feature for building SaaS products
Embedded metadata: ComfyUI embeds the full workflow JSON into generated PNG files — drag an image back into ComfyUI to instantly restore the workflow that created it

2. The Full AI Video Production Pipeline

The "Holy Grail" Workflow

Script: GPT-4o generates a structured 30-second commercial script with scene descriptions, dialogue, and shot composition
Character Design: SDXL + IP-Adapter generates consistent character images from reference photos across all scenes
Pose Control: ControlNet OpenPose ensures characters appear in the correct positions per script stage directions
Animation: Runway Gen-3, Kling, or Luma Dream Machine animates the static character images (image-to-video)
Voice Generation: ElevenLabs or Cartesia synthesizes script dialogue in the character's voice
Lip Sync: Synclabs or Wav2Lip syncs the audio to visible character mouths
Assembly: FFmpeg concatenates scenes, mixes audio, adds background music from MusicGen

# Orchestrating a complete video generation pipeline
import asyncio
from openai import OpenAI
import requests
import subprocess

client = OpenAI()

async def generate_promotional_video(product_name: str, target_audience: str):
    """End-to-end AI video production pipeline."""
    
    # Step 1: Script generation
    script_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Generate a 30-second promotional video script for {product_name} targeting {target_audience}.
            Format as JSON with: scenes (list of {{description, dialogue, duration_s, shot_type}}).
            Total scenes should sum to 30 seconds."""
        }],
        response_format={"type": "json_object"},
    )
    script = json.loads(script_response.choices[0].message.content)
    
    # Step 2: Generate character/product images for each scene
    scene_images = []
    for scene in script["scenes"]:
        # Use ComfyUI API to run our pre-built character generation workflow
        comfy_response = requests.post(
            "http://localhost:8188/prompt",
            json={
                "prompt": {
                    "3": {"inputs": {"text": scene["description"]}},  # CLIP Text Encode node
                    "4": {"inputs": {"seed": 42, "steps": 30, "cfg": 7.0}},  # KSampler
                }
            }
        )
        # Poll for completion and download image...
        scene_images.append(download_comfy_result(comfy_response.json()["prompt_id"]))
    
    # Step 3: Voice synthesis
    audio_files = []
    for scene in script["scenes"]:
        if scene.get("dialogue"):
            tts_response = client.audio.speech.create(
                model="tts-1-hd",
                voice="nova",
                input=scene["dialogue"],
            )
            audio_path = f"audio_scene_{len(audio_files)}.mp3"
            with open(audio_path, "wb") as f:
                f.write(tts_response.content)
            audio_files.append(audio_path)
    
    # Step 4: Animate images to video (Runway Gen-3 API)
    video_clips = []
    for image_path in scene_images:
        # Call Runway ML Gen-3 API for image-to-video
        clip_path = runway_image_to_video(image_path, duration=5)
        video_clips.append(clip_path)
    
    # Step 5: Assemble with FFmpeg
    with open("concat_list.txt", "w") as f:
        for clip in video_clips:
            f.write(f"file '{clip}'
")
    
    subprocess.run([
        "ffmpeg", "-f", "concat", "-safe", "0",
        "-i", "concat_list.txt",
        "-c:v", "copy",
        "final_video.mp4"
    ])
    
    return "final_video.mp4"

3. Production Backend: Modal and RunPod

# Modal.com: Serverless GPU inference — pay per second, scale to zero
# No GPU server to manage, no idle costs

import modal
import io

app = modal.App("comfyui-backend")

# Docker image with ComfyUI and models pre-downloaded
image = (
    modal.Image.debian_slim()
    .pip_install("comfyui>=0.3.3", "torch", "torchvision")
    .run_commands(
        "git clone https://github.com/comfyanonymous/ComfyUI /app/ComfyUI",
        # Download model weights into the image (cached between deployments)
        "wget https://huggingface.co/.../sdxl-base.safetensors -P /app/ComfyUI/models/checkpoints/",
    )
)

@app.function(
    image=image,
    gpu="A10G",           # $0.28/min — scales to zero between requests
    timeout=300,
    scoped_volumes={"/cache": modal.Volume.from_name("model-cache")}
)
def run_comfyui_workflow(workflow_json: dict, prompt_overrides: dict) -> bytes:
    """Run a ComfyUI workflow and return the generated image as bytes."""
    import subprocess
    import json
    import time
    from pathlib import Path
    
    # Inject user-provided prompt into the workflow
    for node_id, overrides in prompt_overrides.items():
        if node_id in workflow_json:
            workflow_json[node_id]["inputs"].update(overrides)
    
    # Start ComfyUI server
    proc = subprocess.Popen(["python", "/app/ComfyUI/main.py", "--port", "8188", "--listen"])
    time.sleep(5)  # Wait for server startup
    
    # Submit workflow
    import requests
    response = requests.post("http://localhost:8188/prompt", json={"prompt": workflow_json})
    prompt_id = response.json()["prompt_id"]
    
    # Poll for completion
    while True:
        history = requests.get(f"http://localhost:8188/history/{prompt_id}").json()
        if prompt_id in history:
            break
        time.sleep(1)
    
    # Return image bytes
    output_node_id = list(history[prompt_id]["outputs"].keys())[0]
    filename = history[prompt_id]["outputs"][output_node_id]["images"][0]["filename"]
    image_data = requests.get(f"http://localhost:8188/view?filename={filename}").content
    
    proc.terminate()
    return image_data

# Deploy: modal deploy comfyui_backend.py
# Then call from your web app:
# bytes = run_comfyui_workflow.remote(workflow_json, {"6": {"inputs": {"text": user_prompt}}})

Frequently Asked Questions

What's the difference between Runway, Kling, and Luma for video generation?

Runway Gen-3: Best for cinematic quality, strong director-style prompt following ("dolly zoom", "rack focus"), $15/month minimum, API available. Kling: Excellent motion quality, competitive pricing, strong at character animation, available via Fal.ai API. Luma Dream Machine: Best at realistic physics and natural motion, great for product visualization, slightly cheaper per generation. For automated pipelines: all three have APIs — test each for your specific content type as quality varies significantly by subject matter (characters vs landscapes vs products).

How do I maintain character consistency across multiple generations?

IP-Adapter is the standard solution: provide a reference image of your character, and IP-Adapter injects the visual features into the cross-attention layers, conditioning every generation toward that appearance. For stronger consistency, combine IP-Adapter with a character LoRA. For commercial workflows: Midjourney's "Character Reference" (\u002d\u002dcref) and Flux's "Character consistency" features simplify this without requiring ComfyUI setup. Always save a "canonical" character reference image and use it as input for every scene.

Conclusion

Generative media workflows are the frontier of applied AI engineering in 2025. The tools exist to automate the entire video production pipeline — from scripted concept to finished video — with minimal human intervention. ComfyUI provides the fine-grained control needed for reproducible image generation; video models (Runway, Kling) handle animation; voice synthesis and lip sync complete the production. Deploying on serverless GPU infrastructure (Modal, RunPod) keeps costs proportional to usage. The competitive advantage goes to engineers who can orchestrate these components reliably at production quality and scale.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact