When will browser agents do real work?

jeudi 13 novembre 2025, 10:00 , par InfoWorld

In January 2025, OpenAI released Operator, the first large-scale agent powered by a computer-use model to control its own browser. The demo was impressive: an AI moving the mouse, clicking buttons, and performing actions like a human would. It was received as a major step toward general-purpose autonomy.

But just eight months later, in August, OpenAI quietly discontinued Operator and rolled it into ChatGPT’s new Agent Mode. Instead of a single, vision-only system, ChatGPT Agents gained access to both a visual browser and a text-based browser. The shift reflected a hard-earned truth: computer-use models don’t yet work reliably enough in production.

Computer-use models perceive and act like humans do. They analyze the browser screen as an image and issue clicks or text inputs at coordinates, which is powerful in theory, but fragile in practice. Rendering differences, latency, and the difficulty of parsing complex layouts all contribute to unreliability. For agents operating at enterprise scale, even a 1% failure rate can be unacceptable.

Vision-based agents

Vision-based agents treat the browser as a visual canvas. They look at screenshots, interpret them using multimodal models, and output low-level actions like “click (210,260)” or “type “Peter Pan”.” This mimics how a human would use a computer—reading visible text, locating buttons visually, and clicking where needed.

The upside is universality: the model doesn’t need structured data, just pixels. The downside is precision and performance: visual models are slower, require scrolling through the entire page, and struggle with subtle state changes between screenshots (“Is this button clickable yet?”).

DOM-based agents

DOM-based agents, by contrast, operate directly on the Document Object Model (DOM), the structured tree that defines every webpage. Instead of interpreting pixels, they reason over textual representations of the page: element tags, attributes, ARIA roles, and labels.

A modern preprocessing technique called accessibility snapshots, popularized by Microsoft’s Playwright MCP server, transforms the live DOM into a structured, readable text form that language models can understand better than pure HTML. For example, a fragment of Google’s home page might look like:

- navigation [ref=e3]:
- link 'About' [ref=e4] -> https://about.google.com
- link 'Store' [ref=e5] -> https://store.google.com
- search [ref=e32]:
- combobox 'Search' [active]
- button 'Search by voice' [ref=e47]

Asteroid

This structured view lets models choose specific elements to act upon (“click ref=e47”) rather than guessing coordinates. DOM-based control is faster and more deterministic. Both are crucial for enterprise workflows that run thousands of browser sessions daily.

Hybrid agents: The current state of browser automation

In practice, both methods have their strengths. Vision models handle dynamic, canvas-based UIs (like dashboards or image-heavy apps). DOM-based models excel at text-rich sites like forms or portals. The best systems today combine both: using DOM actions by default and falling back to vision when necessary.

OpenAI’s decision to deprecate Operator led directly to the creation of the new ChatGPT Agent, which embodies this hybrid approach. Under the hood, it can use either a text browser or a visual browser, choosing the most effective one per step. This is far more reliable than Operator’s pure computer-use model.

Models like Claude 4 and opencua-72b-preview show that visual grounding and faster perception are improving monthly. Computer-use models will continue advancing as multimodal architectures evolve. Eventually, pure vision agents may reach the precision and speed needed for mainstream deployment.

But in 2025, production systems are still hybrid. The most reliable browser agents orchestrate multiple techniques: DOM reasoning for structured elements, vision fallback for non-standard layouts, and deterministic scripting for validation and replay. The frontier isn’t yet a single model, it’s the composition of models, selectors, and orchestration frameworks that together make agents truly usable.

The future of browser agents lies not in vision or structure alone, but in orchestrating both intelligently.

Learning by doing: The next step for browser agents

Hybrid systems solve reliability for today, but the next challenge is adaptability. How can a browser agent not just complete a task once, but actually learn from experience and improve over time?

Running a browser agent once successfully doesn’t mean it can repeat the task reliably. The next frontier is learning from exploration: transforming first-time behaviors into reusable automations.

A promising strategy starting to be deployed more and more is to let agents explore workflows visually, then encode those paths into structured representations like DOM selectors or code. Think of it as a two-stage process:

Exploration phase: The agent uses computer-use or vision models to discover the structure of a new web page and record successful navigation paths.

Execution phase: The agent compiles that knowledge into deterministic scripts, for example, Playwright, Selenium, or CDP (Chrome DevTools Protocol) commands to repeat the process with high reliability.

With new large language models excelling at writing and editing code, these agents can self-generate and improve their own scripts, creating a cycle of self-optimization. Over time, the system becomes similar to a skilled worker: slower on the first task, but exponentially faster on repeat executions.

This hybrid, self-improving approach—combining vision, structure, and code synthesis—is what makes browser automation increasingly robust. It’s not just about teaching models to click; it’s about enabling them to learn how to automate.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Lire la suite sur InfoWorld