opncrafter

Claude Computer Use: How Anthropic's Agent Operates Your Mac

For the last six years, UI automation meant writing brittle Selenium scripts, maintaining XPath selectors that broke every time a frontend developer changed a div class, and dealing with headless browser fingerprinting. Then, Anthropic released the Computer Use API for Claude 3.5 Sonnet. Unlike standard API calls that return text or JSON, this API actually returns coordinates, clicks, and keystrokes based on screenshots of your desktop.

I've spent the last month building production automation pipelines with this API. It feels like watching a ghost operate your cursor. Here is my unfiltered technical deep dive into what it can actually do today, the architecture required to run it, the code to build it, and where the vision model falls completely flat.


The Architecture of a Computer Use Agent

If you are an engineer looking to implement this, the first thing you must understand is that there is no magic happening at the system level. Anthropic does not provide a daemon that hooks into your Mac's Accessibility APIs. Claude cannot inherently "see" your screen. You have to build the execution loop yourself in Python or Node.js.

The architecture consists of four distinct phases that run in a continuous while loop:

  1. The Perception Layer (Screenshot): A script takes a high-resolution screenshot of your active window and scales it down to fit within Claude's vision context limits.
  2. The Intelligence Layer (Inference): The image, along with the current system prompt ("Find the nearest WiFi cafe and save it to a text file"), is sent to the Anthropic API.
  3. The Decision Layer (JSON Response): Claude analyzes the image, identifies the interactive elements (buttons, inputs), and returns a structured JSON payload with precise X/Y coordinates to click, or specific strings to type.
  4. The Execution Layer (Action): Your local tool (usually PyAutoGUI or xdotool) interprets that JSON and physically moves the mouse cursor to those coordinates, clicks, or simulates keystrokes.

Implementing the Execution Loop in Python

To build this, you need the Anthropic SDK and PyAutoGUI. We must explicitly define the computer_use tool in our API schema. The tool schema requires specific properties tailored to screen coordinates.

import os
import base64
import pyautogui
import anthropic
from PIL import ImageGrab

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Define the specialized computer_use tool
tools = [
    {
        "type": "computer_use_20241022",
        "name": "computer",
        "display_width_px": 1920,
        "display_height_px": 1080,
        "display_number": 1
    }
]

def capture_screen():
    """Takes a screenshot and encodes it natively for Claude Vision."""
    screenshot = ImageGrab.grab()
    screenshot.save("current_state.png")
    with open("current_state.png", "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def execute_action(action):
    """Translates Claude's JSON response into physical hardware movements."""
    action_type = action.get("action")
    
    if action_type == "mouse_move":
        x, y = action.get("coordinate")
        pyautogui.moveTo(x, y, duration=0.5)
        print(f"Moved mouse to ({x}, {y})")
        
    elif action_type == "left_click":
        pyautogui.click()
        print("Clicked left button")
        
    elif action_type == "type":
        text = action.get("text")
        pyautogui.write(text, interval=0.05)
        print(f"Typed: {text}")
        
    elif action_type == "key":
        key = action.get("text")
        pyautogui.press(key)
        print(f"Pressed key: {key}")

Notice how we explicitly pass display_width_px and display_height_px in the tool schema. Claude uses these dimensions to scale its internal coordinate system so the X/Y outputs match your physical monitor resolution. If you pass 1080p but your Macbook is running at Retina 4K, the clicks will be wildly inaccurate.

The Inference Loop

Now we wrap this in a loop. We pass the screenshot and the user's prompt to the API, wait for a tool_use response, execute the physical hardware action, take *another* screenshot to verify the UI changed, and pass it back.

def run_computer_agent(task_prompt):
    messages = [
        {"role": "user", "content": [{"type": "text", "text": task_prompt}]}
    ]
    
    while True:
        # Capture current screen state
        b64_image = capture_screen()
        
        # Append image to the message history so Claude sees the current state
        messages[-1]["content"].append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": b64_image
            }
        })

        # Call Claude 3.5 Sonnet
        response = client.beta.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            tools=tools,
            messages=messages,
            betas=["computer-use-2024-10-22"]
        )
        
        # Check if the model decided the task is complete
        if response.stop_reason != "tool_use":
            print("Task completed!")
            print("Final response:", response.content[0].text)
            break
            
        # Parse and execute the requested tool actions
        for content_block in response.content:
            if content_block.type == "tool_use" and content_block.name == "computer":
                action_payload = content_block.input
                print(f"Executing: {action_payload}")
                
                # Run the physical keyboard/mouse movement
                execute_action(action_payload)
                
        # Append Claude's output to history to maintain context
        messages.append({
            "role": "assistant",
            "content": response.content
        })
        
        # We simulate a "User" role responding with the result of the tool
        # In reality, the new screenshot in the next loop serves as visual feedback
        messages.append({
            "role": "user", 
            "content": [{"type": "text", "text": "Action executed. See updated screenshot."}]
        })

# Kick off the agent
run_computer_agent("Open Safari, navigate to weather.com, and check the weekend forecast for Seattle.")

The Latency and State Problem

When you watch this script run for the first time, your jaw will drop. Watching an AI physically open Safari, type "weather.com", hit Enter, wait for the page to load, click the search bar, type a city name, and read the resulting text is the closest thing to AGI I have seen locally.

However, when building a production system around this concept, you run into the greatest enemy of UI automation: Latency and Transient States.

The "Loading Spinner" Race Condition

Because of the continuous Screenshot ➔ Network Request ➔ Inference ➔ Action loop, a single step can take 4 to 8 seconds. If Claude clicks "Submit", the UI might show a loading spinner. If the Python script takes a screenshot immediately after the click, Claude sees the spinner. But by the time Claude returns its next JSON action ("Wait for it to load"), the UI has already finished loading. This desync between the API's mental model and the actual local screen state causes infinite loops where Claude keeps moving the mouse in circles trying to find a button that already loaded and shifted down the page.

To fix this, you must build robust polling into your execution layer. Instead of blindly taking a screenshot immediately after a click, I use Python to monitor network traffic (via proxy logs) or monitor pixel variance in a specific bounding box. The script only takes the next screenshot once the screen has remained static for 500 milliseconds.

Where The Vision Model Fails Miserably

I don't want to sugarcoat this—it is still a beta feature. While it can handle standard CRUD React web apps perfectly fine, it struggles heavily with modern, complex UI architectures.

  • Infinite Scrolling & Virtualized Lists: It struggles to know when to stop scrolling down a feed in apps like Twitter, Jira, or LinkedIn. Because the DOM isn't available (it only sees pixels), it doesn't know if there are 10 items or 10,000 items left. It often loses context of what it was originally searching for as items slide off the top of the screenshot.
  • Hover States & Ghost UI: Modern apps hide tools behind hover states (e.g., hovering over a Slack message to see the "React" button). Claude cannot hover and see the result in one step. It has to output a mouse_move action, wait for the Python script to move the mouse, take a new screenshot showing the revealed button, and then issue a left_click action. This doubles the token cost for every single interaction.
  • Low Contrast Modals: Tiny dropdown carats, low-contrast "X" close buttons on modals, or highly customized WebGL canvas elements (like Figma or Excalidraw) frequently confuse the vision model. It will repeatedly try to click 5 pixels to the left of the actual close button.

Unit Economics: The Hidden Cost

Passing a 1080p representation of a screen to Claude 3.5 Sonnet consumes a massive amount of tokens. Each high-res screenshot is processed as thousands of tokens. In a standard 20-step automation flow (opening an app, logging in, navigating, scraping data), you are passing 20 distinct images in the context window.

Right now, a moderate automation task costs around $0.20 to $0.40 in API credits per run. If you use Prompt Caching appropriately (caching the system instructions and early screenshots), you can reduce this, but it is vastly more expensive than a traditional Puppeteer script running local DOM evaluation.

How I Recommend Using It Today

Right now, I only deploy Computer Use for specific, high-value tasks that traditional automation cannot touch:

  1. Anti-Bot Web Scraping: Sites like LinkedIn or Amazon aggressively block Puppeteer and Selenium fingerprints. Because Computer Use drives physical hardware (or a VNC Docker container) via Python's PyAutoGUI, it is entirely immune to standard DOM-based bot detection. It looks exactly like a slow human clicking a mouse.
  2. Legacy Desktop ERPs: The primary enterprise use case is automating Windows 95-era internal ERP systems that lack REST APIs. Banks and healthcare providers have hundreds of these systems. Slapping a Computer Use vision agent over an RDP connection is infinitely faster than upgrading the mainframe.
  3. QA UI Testing without Selectors: Writing Playwright tests with CSS selectors (getByTestId('submit-btn')) is rigid. A Computer Use agent can be told "Log in and buy a shirt." If a designer changes the button from blue to green or changes the class name, the test doesn't break. The agent adapts visually just like a human tester would.

For anything else—standard data fetching, headless scraping, or API testing—traditional code is still cheaper and faster. But the writing is on the wall: within two years, pixel-based semantic UI automation will be the standard way we automate workflows across operating systems.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK