⏱ 8–11 min read🎓 IntermediateUpdated Apr 2026

Multimodal Agents: Vision & Audio

Dec 29, 2025 • 20 min read

The era of "text in, text out" is over. Multimodal agents can see screenshots, analyze charts, transcribe audio, extract data from PDFs, and reason across all of these modalities simultaneously. GPT-4o processes text, images, and audio natively — opening up a new category of AI applications that simply weren't possible with text-only models.

1. What Are Multimodal Models?

A multimodal model accepts multiple input types and reasons across all of them in a unified context. Unlike pipelines that chain separate vision, speech, and text models, a native multimodal model sees all modalities simultaneously and can reason about their relationships:

GPT-4o: Text + images + audio natively. Best for most tasks.
Claude 3.5 Sonnet: Text + images. Excellent for document analysis.
Gemini 1.5 Pro: Text + images + audio + video + 1M token context. Best for long video.
LLaVA / Llava-next: Open-source vision model. Runs locally via Ollama.
Whisper: OpenAI's dedicated audio-to-text model. Industry best for transcription.

2. Vision: Analyzing Images

Pass an image URL or base64-encoded image directly in the message content:

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

// Method 1: Pass image URL (for public images)
const urlResponse = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{
    role: "user",
    content: [
      { type: "text", text: "Describe what you see in this chart. What are the key trends?" },
      { type: "image_url", image_url: { url: "https://example.com/chart.png" } }
    ],
  }],
  max_tokens: 1000,
});

// Method 2: Base64 encode local images (for private/uploaded files)
const imageBuffer = fs.readFileSync('./screenshot.png');
const base64Image = imageBuffer.toString('base64');

const base64Response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{
    role: "user",
    content: [
      { type: "text", text: "What UI bugs do you see in this screenshot?" },
      { 
        type: "image_url", 
        image_url: { 
          url: `data:image/png;base64,${base64Image}`,
          detail: "high"  // "low" (cheaper) or "high" (more detail for fine-grained tasks)
        } 
      }
    ],
  }],
});

3. Use Case: Screen Agent (Computer Use)

An agent that can see your screen and take actions opens up an entirely new paradigm of automation. The pattern: capture screenshot → analyze → decide action → execute → repeat:

import pyautogui
import openai
import base64
import json
from PIL import Image
import io

def capture_screen() -> str:
    """Capture screen and return as base64"""
    screenshot = pyautogui.screenshot()
    buffer = io.BytesIO()
    screenshot.save(buffer, format='PNG')
    return base64.b64encode(buffer.getvalue()).decode()

def get_next_action(screenshot_b64: str, goal: str, history: list) -> dict:
    """Ask GPT-4o what to do next based on the current screen"""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a computer control agent. Goal: {goal}
Analyze the screen and return the next action as JSON:
{{"action": "click|type|scroll|done", "x": int, "y": int, "text": str, "reasoning": str}}"""},
            *history,
            {"role": "user", "content": [
                {"type": "text", "text": "Current screen state:"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}", "detail": "high"}}
            ]}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

def execute_action(action: dict):
    if action["action"] == "click":
        pyautogui.click(action["x"], action["y"])
    elif action["action"] == "type":
        pyautogui.typewrite(action["text"], interval=0.05)
    elif action["action"] == "scroll":
        pyautogui.scroll(action.get("amount", 3))

4. Audio: Transcription with Whisper

OpenAI Whisper is the gold standard for speech-to-text. It supports 99 languages and handles accents, background noise, and technical vocabulary remarkably well:

from openai import OpenAI

client = OpenAI()

# Transcribe audio file
with open("meeting_recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",  # Includes timestamps and word-level timing
        timestamp_granularities=["word"]  # Word-level timestamps
    )

print(transcript.text)
# "The Q3 revenue was $2.4 million, up 18% year over year..."

# With timestamps for speech-to-text alignment:
for word in transcript.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

5. Document Analysis: PDFs and Charts

Convert PDFs to images (one page = one image) and pass each page to the vision model for analysis. This works better than PDF text extraction for charts, tables, and scanned documents:

from pdf2image import convert_from_path
import base64
import io
from openai import OpenAI

client = OpenAI()

def analyze_pdf(pdf_path: str, question: str) -> str:
    # Convert PDF pages to images
    pages = convert_from_path(pdf_path, dpi=150)
    
    # Build message with all pages as images
    content = [{"type": "text", "text": question}]
    
    for i, page in enumerate(pages[:10]):  # Limit to 10 pages
        buffer = io.BytesIO()
        page.save(buffer, format='PNG')
        img_b64 = base64.b64encode(buffer.getvalue()).decode()
        
        content.append({"type": "text", "text": f"Page {i+1}:"})
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    
    return response.choices[0].message.content

# Usage: extract data from a financial report
result = analyze_pdf("annual_report.pdf", "Extract all revenue figures by quarter as JSON")

6. Multimodal RAG: Indexing Visual Content

Standard RAG only retrieves text. Multimodal RAG stores both page images and extracted text, combining visual and semantic search:

# Strategy: Extract text from visuals using GPT-4o, then index both
def index_pdf_multimodal(pdf_path: str, collection):
    pages = convert_from_path(pdf_path, dpi=150)
    
    for i, page_img in enumerate(pages):
        # Step 1: Extract text + structure from the page image
        extracted = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": [
                {"type": "text", "text": "Extract all text, numbers, headings, and table data from this page. Preserve structure."},
                {"type": "image_url", "image_url": {"url": img_to_b64(page_img), "detail": "high"}}
            ]}]
        ).choices[0].message.content
        
        # Step 2: Store text in vector DB with page reference
        collection.add(
            documents=[extracted],
            metadatas=[{"page": i+1, "source": pdf_path}],
            ids=[f"{pdf_path}_page_{i+1}"]
        )
        
        # Optionally: save page images for visual retrieval

7. Real-World Use Cases

Medical imaging assistant: Analyze X-rays or pathology slides alongside patient notes
Financial document processing: Extract figures from scanned invoices and earnings reports
Meeting intelligence: Transcribe + summarize + extract action items from recordings
Visual QA for e-commerce: Answer "Does this shirt have buttons?" from product photos
Accessibility tools: Generate alt-text for images at scale
Code from design: Generate React components from Figma screenshots

Frequently Asked Questions

How much does vision cost vs text-only?

Image tokens are calculated based on resolution. A 1024×1024 image in "high" detail mode costs approximately 765 tokens (~$0.004 at GPT-4o pricing). For high-volume image processing, use "low" detail mode (85 tokens per image) when fine geometric detail isn't needed.

Can I run vision models locally?

Yes — LLaVA 1.6 and Llava-next are available through Ollama. Run ollama pull llava and point your requests at localhost:11434. Quality is lower than GPT-4o for complex reasoning but competitive for basic image captioning and OCR tasks.

How do I handle large PDFs (100+ pages)?

Process in batches. For Q&A: use text extraction (PyMuPDF) for pages 11+ and fall back to vision only for pages that contain charts/tables. For full visual analysis: limit to 20-30 pages per request and stitch answers together with a summarization pass.

Conclusion

Multimodal AI transforms agents from text processors into systems that can perceive and reason about the full richness of the world — documents, interfaces, audio recordings, and visual data. The technical barrier is now low: GPT-4o's vision API requires only minor additions to standard text-based code. The creative barrier — imagining what becomes possible — is the more interesting challenge.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact