Multimodal UI: Image & Vision

Jan 3, 2026 • 22 min read

Text-to-Image is cool. But Interactive Image Editing is useful. Combined with Vision (User uploads an image), we move from "Chatbot" to "Expert Consultant".

1. The Vision Pattern: "Look at this"

Use Case: "Here is a photo of my fridge. What can I cook?"

This requires a different Tool Input. We can't just send text strings to the server. We must send Base64 images or blob URLs.

Architecture:

Client: <input type="file" /> converts image to Base64.
Server Action: Sends image + prompt to GPT-4o.
LLM: Returns text (Recipe) or Tool Call (Shopping List).

// Client Component (ImageUpload.jsx)
const handleFileChange = async (e) => {
  const file = e.target.files[0];
  const base64 = await toBase64(file);
  
  // Optimistic UI: Show preview instantly
  setMessages(prev => [...prev, <UserImage src={URL.createObjectURL(file)} />]);
  
  // Server Action
  const result = await analyzeImage(base64);
}

2. Screenshot-to-Code (v0 clone)

The "Holy Grail" of GenUI is uploading a screenshot and getting working React code back.

The Prompt Engineering

You must explicitly instruct the model on your design system.

"You are an expert React developer. Analyze this image. Reproduce it using Tailwind CSS. Do not use custom CSS. Use only Lucide React icons."

3. Interactive Image Generation

Generating an image is easy (`dalle-3`). Editing it is hard. We use a Conversation Loop to refine the image.

Generation: DALL-E 3 creates 4 variants.
Selection: User clicks Option 2. "I like this style."
Refinement: User types "Make the cup green."
Inpainting: The LLM calls a specialized inpainting tool with the mask of the cup (calculated via Segment Anything Model or user brush).

// Server Action for Refinement
async function refineImage(currentUrl, instruction) {
  'use server';
  
  // 1. Vision Model analyzes current image to update distinct description
  const description = await gpt4o.analyze(currentUrl);
  
  // 2. New Prompt construction
  const newPrompt = `${description}. Change: ${instruction}`;
  
  // 3. Generate
  const newUrl = await dalle3.generate(newPrompt);
  
  return <ImageResult url={newUrl} />;
}

4. The Future: Video Input

Gemini 1.5 Pro allows for 1 hour of video context. Imagine uploading a lecture video and asking: "Create a quiz based on the second concept discussed."

The UI patterns remain the same: Upload → Analysis → Structured Output → Generative UI Component.

Conclusion

Multimodal UI breaks the text barrier. By allowing users to "Show" instead of just "Tell", we reduce friction and increase the context available to the model.