⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Browser Use: Agents that Surf the Web

Jan 1, 2026 • 20 min read

Traditional web automation (Selenium, Puppeteer) relies on brittle CSS selectors — #submit-btn-v2. The moment a developer renames that element, your automation breaks. Vision-based browser agents take screenshots, annotate interactive elements with bounding boxes, and use a multimodal LLM to reason about what to click based on visual appearance and semantic meaning. Button renamed from #submit-new to #cta-button? The agent still clicks it — it reads "Submit" on the button, not the HTML attribute.

1. The Vision Pipeline: How Agents See the Web

Screenshot: Agent takes a screenshot of the current browser page state
Accessibility Tree: Extracts a text representation of interactive elements (simplified DOM used by screen readers)
Annotation: System draws numbered bounding boxes over every clickable element, input field, and link
Vision LLM call: Sends annotated screenshot + accessibility tree + task to GPT-4o/Claude 3.5 Sonnet
Reasoning: LLM outputs: "I see a search box labeled '14'. I'll type 'mechanical keyboard' into element 14."
Action: Agent executes structured tool call: click(element_id=14) → type("mechanical keyboard") → press_key("Enter")
Loop: Return to step 1 until task completed or max_steps reached

2. Setup and Basic Usage

pip install browser-use playwright langchain-openai
playwright install chromium  # Install browser binary

import asyncio
from langchain_openai import ChatOpenAI
from browser_use import Agent, Browser, BrowserConfig

# CRITICAL: You MUST use a vision-capable model
# gpt-4o = fast, good at most tasks
# claude-3-5-sonnet = better at complex reasoning, higher cost
llm = ChatOpenAI(model="gpt-4o", temperature=0)

async def run_browser_agent():
    # Simple task
    agent = Agent(
        task=(
            "Go to amazon.com. Search for 'mechanical keyboard'. "
            "Filter by 4+ stars. Find the cheapest option under $100. "
            "Click on it and report the full product name, price, and rating."
        ),
        llm=llm,
        max_steps=20,           # Safety limit — prevent infinite loops
        use_vision=True,        # Enable screenshot-based reasoning
    )

    result = await agent.run()
    print("Task Result:", result.extracted_content)
    print("Steps taken:", result.number_of_steps)
    print("Was successful:", result.is_done)

asyncio.run(run_browser_agent())

3. Persistent Sessions for Authenticated Workflows

from browser_use import Agent, Browser, BrowserConfig
from pathlib import Path

# Challenge: Agents can't handle 2FA, CAPTCHA-gated logins, or SSO
# Solution: Use a persistent browser profile — log in once manually, save cookies

PROFILE_DIR = Path.home() / ".browser_use" / "my_profile"

browser = Browser(
    config=BrowserConfig(
        # Headless=False allows you to see what the agent is doing (useful for debugging)
        headless=False,
        
        # Persistent context saves cookies, localStorage, sessionStorage between runs
        # Log in manually on first run; subsequent runs reuse the session
        user_data_dir=str(PROFILE_DIR),
        
        # Disable browser fingerprinting signals that detect automation
        disable_security=True,
    )
)

agent = Agent(
    task=(
        "Go to app.linear.app. I'm already logged in. "
        "Create a new issue titled 'Fix navigation bug' in the Engineering team. "
        "Set priority to High and assign it to me."
    ),
    llm=ChatOpenAI(model="gpt-4o"),
    browser=browser,
)

result = await agent.run()

# WORKFLOW: 
# Step 1 (first time only, manual): Run with headless=False, log in to the website
# Step 2 (automated): Profile saved, subsequent runs skip login entirely
# Step 3: Works for most business apps (Linear, Notion, CRMs, etc.)

4. Custom Actions and Extracted Data

from browser_use import Agent, Controller
from pydantic import BaseModel
from typing import List

# Define what structured data to extract
class ProductInfo(BaseModel):
    name: str
    price: float
    rating: float
    review_count: int
    asin: str

class ProductList(BaseModel):
    products: List[ProductInfo]

controller = Controller()

# Register a custom action the agent can call
@controller.action('Save product data to database', param_model=ProductInfo)
async def save_product(params: ProductInfo):
    """Called by the agent when it wants to save a found product."""
    await db.products.insert_one(params.dict())
    return f"Saved {params.name} to database"

agent = Agent(
    task=(
        "Search Amazon for 'wireless mouse'. "
        "For the top 5 results: extract name, price, rating, review count, and ASIN. "
        "Call save_product for each one."
    ),
    llm=ChatOpenAI(model="gpt-4o"),
    controller=controller,
    # Return structured data at task completion
    result_type=ProductList,
)

result = await agent.run()
products: ProductList = result.final_result()
print(f"Extracted {len(products.products)} products")

5. Production Infrastructure

Provider	Features	Cost	Best For
Local Playwright	Full control, no latency	Free	Dev/testing, small scale
Steel.dev	Sessions, proxy rotation, recording	$0.002/min	Production agents, scale
Browserless.io	Scalable, REST API	$0.002/min	High concurrency
Bright Data	Residential proxies, unblocking	$3-15/GB	Anti-bot bypass
Apify	Actor marketplace, scheduling	$0.004/CU	Workflow automation

Frequently Asked Questions

How do I handle CAPTCHAs and bot detection?

Cloudflare and similar WAFs detect Playwright automation through browser fingerprinting: automation flags in the JavaScript engine, missing browser APIs, suspicious timing patterns, and known datacenter IP ranges. Solutions in order of effectiveness: (1) Use playwright-extra with puppeteer-extra-plugin-stealth — patches 20+ browser automation fingerprints. (2) Use residential proxies (Bright Data, Oxylabs) to route traffic through real residential IPs. (3) For visual CAPTCHAs: integrate 2Captcha or CapMonster APIs — you send the CAPTCHA image, receive the solution in ~10 seconds. (4) For Cloudflare Turnstile: use undetected-chromedriver or Steel.dev's built-in unblocking features. Some sites are effectively impossible to automate reliably — redirect your energy to official APIs if they exist.

How do I reduce the vision token cost ($0.02+ per step)?

Use DOM Distillation: for simple, predictable pages (login forms, data tables, navigation menus), skip the screenshot and send only the accessibility tree as text — this costs ~90% less. Only invoke vision when layout analysis is genuinely needed (detecting visual hierarchy, reading non-text elements like buttons identified by icon alone, handling unusual UI patterns). Configure use_vision=False in browser-use for text-heavy tasks and use_vision=True (or per-step switching) for visual tasks. Also: narrow your task scope to minimize steps — a 5-step task at $0.02/step is $0.10, while a 20-step task is $0.40. Clear, specific task descriptions reduce exploratory steps.

Conclusion

Vision-based browser agents are the correct approach for web automation that needs to remain resilient to UI changes. By combining Playwright's browser control, accessibility tree parsing, and GPT-4o's visual reasoning, the browser-use library makes it possible to automate complex multi-step workflows across any website without hardcoding selectors. For production deployments, cloud browser infrastructure (Steel.dev, Browserless) provides the scalability, session management, and proxy rotation needed to run agents reliably at scale. The main costs to manage are vision token usage (use DOM distillation for simple pages) and CAPTCHA bypass (stealth mode + residential proxies cover most cases).

← Back to Knowledge Hub

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact