Multimodal UI: Image & Vision
Jan 3, 2026 • 22 min read
Text-to-Image is cool. But Interactive Image Editing is useful. Combined with Vision (User uploads an image), we move from "Chatbot" to "Expert Consultant".
1. The Vision Pattern: "Look at this"
This requires a different Tool Input. We can't just send text strings to the server. We must send Base64 images or blob URLs.
Architecture:
- Client:
<input type="file" />converts image to Base64. - Server Action: Sends image + prompt to GPT-4o.
- LLM: Returns text (Recipe) or Tool Call (Shopping List).
// Client Component (ImageUpload.jsx)
const handleFileChange = async (e) => {
const file = e.target.files[0];
const base64 = await toBase64(file);
// Optimistic UI: Show preview instantly
setMessages(prev => [...prev, <UserImage src={URL.createObjectURL(file)} />]);
// Server Action
const result = await analyzeImage(base64);
}2. Screenshot-to-Code (v0 clone)
The "Holy Grail" of GenUI is uploading a screenshot and getting working React code back.
The Prompt Engineering
You must explicitly instruct the model on your design system.
"You are an expert React developer. Analyze this image. Reproduce it using Tailwind CSS. Do not use custom CSS. Use only Lucide React icons."3. Interactive Image Generation
Generating an image is easy (`dalle-3`). Editing it is hard. We use a Conversation Loop to refine the image.
- Generation: DALL-E 3 creates 4 variants.
- Selection: User clicks Option 2. "I like this style."
- Refinement: User types "Make the cup green."
- Inpainting: The LLM calls a specialized inpainting tool with the mask of the cup (calculated via Segment Anything Model or user brush).
// Server Action for Refinement
async function refineImage(currentUrl, instruction) {
'use server';
// 1. Vision Model analyzes current image to update distinct description
const description = await gpt4o.analyze(currentUrl);
// 2. New Prompt construction
const newPrompt = `${description}. Change: ${instruction}`;
// 3. Generate
const newUrl = await dalle3.generate(newPrompt);
return <ImageResult url={newUrl} />;
}4. The Future: Video Input
Gemini 1.5 Pro allows for 1 hour of video context. Imagine uploading a lecture video and asking: "Create a quiz based on the second concept discussed."
The UI patterns remain the same: Upload → Analysis → Structured Output → Generative UI Component.
Conclusion
Multimodal UI breaks the text barrier. By allowing users to "Show" instead of just "Tell", we reduce friction and increase the context available to the model.