Multimodal UI: Image & Vision
Jan 3, 2026 β’ 22 min read
Text-to-Image is cool. Interactive Image Analysis is useful. Combining GPT-4o Vision with image generation creates a qualitative leap: users shift from "describe what you want" to "here, look at this" β the natural way humans communicate. This guide builds the complete multimodal UI stack: image upload and analysis, screenshot-to-code generation, interactive DALL-E refinement, and video context handling.
1. The Vision Pattern: File Upload to LLM Analysis
Accepting image input requires a three-stage pipeline: browser File API β base64 encoding β vision model API. The key engineering challenge is keeping the UI responsive while processing potentially large files:
// ImageUpload.tsx - Client Component
'use client';
import { useState, useCallback } from 'react';
import { analyzeImageAction } from './actions';
export function ImageUploadAnalyzer() {
const [preview, setPreview] = useState<string | null>(null);
const [analysis, setAnalysis] = useState<string>('');
const [isAnalyzing, setIsAnalyzing] = useState(false);
const toBase64 = (file: File): Promise<string> =>
new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onload = () => {
// Extract base64 data (remove data:image/...;base64, prefix)
const result = reader.result as string;
resolve(result.split(',')[1]);
};
reader.onerror = reject;
reader.readAsDataURL(file);
});
const handleFileChange = async (e: React.ChangeEvent<HTMLInputElement>) => {
const file = e.target.files?.[0];
if (!file) return;
// Validate file type and size
if (!file.type.startsWith('image/')) {
alert('Please upload an image file');
return;
}
if (file.size > 20 * 1024 * 1024) { // 20MB limit
alert('File too large. Please use an image under 20MB');
return;
}
// Optimistic UI: Show preview instantly before server roundtrip
const objectUrl = URL.createObjectURL(file);
setPreview(objectUrl);
setIsAnalyzing(true);
setAnalysis('');
try {
// Convert to base64 for server action
const base64 = await toBase64(file);
const mediaType = file.type as 'image/jpeg' | 'image/png' | 'image/gif' | 'image/webp';
// Stream analysis from Server Action
const stream = await analyzeImageAction(base64, mediaType);
for await (const chunk of stream) {
setAnalysis(prev => prev + chunk);
}
} finally {
setIsAnalyzing(false);
URL.revokeObjectURL(objectUrl); // Clean up object URL
}
};
return (
<div>
<input
type="file"
accept="image/jpeg,image/png,image/gif,image/webp"
onChange={handleFileChange}
disabled={isAnalyzing}
/>
{preview && <img src={preview} alt="Preview" style={{ maxWidth: '400px', borderRadius: '8px' }} />}
{analysis && <div className="prose">{analysis}</div>}
</div>
);
}// actions.ts - Server Action
'use server';
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
export async function* analyzeImageAction(
base64: string,
mediaType: 'image/jpeg' | 'image/png' | 'image/gif' | 'image/webp',
prompt?: string
) {
const userPrompt = prompt ||
"Analyze this image in detail. Describe what you see, identify key elements, and provide any relevant insights or suggestions.";
const stream = client.messages.stream({
model: "claude-opus-4-5",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: mediaType,
data: base64,
}
},
{
type: "text",
text: userPrompt,
}
]
}
]
});
// Stream text chunks to client
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
yield event.delta.text;
}
}
}2. Screenshot-to-Code (v0 Clone Pattern)
The "Holy Grail" of Generative UI: upload a screenshot mockup and receive working React code:
// Screenshot-to-code Server Action
export async function* screenshotToCode(base64: string) {
const SYSTEM_PROMPT = `You are an expert React developer.
Your job is to convert UI screenshots into clean React component code.
Rules:
1. Use Tailwind CSS classes only (no inline styles, no custom CSS)
2. Use only Lucide React for icons
3. Make the component fully self-contained
4. Include realistic placeholder text (not "Lorem ipsum")
5. Make it pixel-perfect to the design
6. Include TypeScript type annotations
7. Export as a default export
Respond with ONLY the TypeScript/TSX code, no explanations.`;
const stream = client.messages.stream({
model: "claude-opus-4-5",
max_tokens: 4096,
system: SYSTEM_PROMPT,
messages: [{
role: "user",
content: [
{
type: "image",
source: { type: "base64", media_type: "image/png", data: base64 }
},
{
type: "text",
text: "Convert this screenshot to a React component with Tailwind CSS."
}
]
}]
});
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
yield event.delta.text;
}
}
}
// Client component to display and copy generated code
export function CodeOutput({ code }: { code: string }) {
const jsxCode = code.replace(/```tsx?
?/g, '').replace(/```
?/g, '');
return (
<div>
<button onClick={() => navigator.clipboard.writeText(jsxCode)}>
Copy Code
</button>
<pre><code>{jsxCode}</code></pre>
</div>
);
}3. Interactive Image Generation and Refinement
// DALL-E 3 generation with conversational refinement
import OpenAI from 'openai';
const openai = new OpenAI();
// Step 1: Initial generation
async function generateImage(userDescription: string) {
const response = await openai.images.generate({
model: "dall-e-3",
prompt: userDescription,
n: 1,
size: "1024x1024",
quality: "hd", // "standard" or "hd"
style: "vivid", // "vivid" = more dramatic, "natural" = more realistic
response_format: "url"
});
return {
url: response.data[0].url,
revised_prompt: response.data[0].revised_prompt // DALL-E 3 shows you its interpretation
};
}
// Step 2: Refinement via VisionβTextβImage loop
// DALL-E 3 doesn't support direct inpainting, so we use a describe+regenerate loop:
async function refineImage(currentImageUrl: string, instruction: string) {
// First: Use vision to get precise description of current image
const descriptionResponse = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: [
{ type: "image_url", image_url: { url: currentImageUrl, detail: "high" } },
{ type: "text", text: "Describe this image in precise detail. Include colors, styles, composition, lighting, and all visual elements." }
]
}],
max_tokens: 500,
});
const currentDescription = descriptionResponse.choices[0].message.content!;
// Construct guided refinement prompt
const refinedPrompt = `${currentDescription}
Apply this change: ${instruction}
Keep everything else exactly the same.`;
return openai.images.generate({
model: "dall-e-3",
prompt: refinedPrompt,
n: 1,
size: "1024x1024",
quality: "hd",
});
}Frequently Asked Questions
How do I handle large images efficiently without hitting token limits?
Vision models accept images at different "detail" levels. OpenAI's GPT-4o uses detail: "low" (fixed 85 tokens, good for general queries) or detail: "high" (up to 2048 tokens but 20x more expensive). Anthropic's Claude auto-resizes to fit context. For bulk image processing: resize images to 1024px on the longer side before upload β this maintains sufficient detail for most analysis while reducing token cost. For document/screenshot analysis where precise text matters: use detail: "high" and crop to the relevant section rather than uploading a full-screen screenshot.
Can users paste images directly from clipboard?
Yes β handle the paste event and check e.clipboardData.items for image MIME types. Users can copy-paste screenshots directly into your text input area, dramatically reducing friction vs. the file picker dialog. This pattern is used by Claude.ai, ChatGPT, and most modern AI chat interfaces:
document.addEventListener('paste', (e) => {
const items = Array.from(e.clipboardData?.items || []);
const imageItem = items.find(i => i.type.startsWith('image/'));
if (imageItem) {
const file = imageItem.getAsFile();
handleImageFile(file);
}
});Conclusion
Multimodal UI breaks the text-only barrier that limits most AI applications. By accepting image input through the browser File API, converting to base64 for server processing, and streaming vision analysis back to the client, you give users the ability to "show" rather than "describe" β dramatically reducing friction for complex queries. The screenshot-to-code pattern (vision β React generation) demonstrates the most powerful form of multimodal UI: AI that understands visual design and translates it directly into working code. Combine these patterns with interactive DALL-E refinement loops and you have the foundation of a genuine AI creative tool.
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.