I once gave a “computer-use” agent a task that sounded embarrassingly simple: “File this $23.14 ride receipt and submit the expense report.”
It opened the right email. Downloaded the PDF. Navigated to the company portal. Then it did what every brittle automation system does when it leaves the happy path: clicked the wrong dropdown, typed into a non-focused field, lost the file picker, and spiraled into a loop of “scroll, click, scroll, click” until the episode timed out.
The model wasn’t “dumb.” The task wasn’t “hard.” The real problem was that I hadn’t built an environment—I had built a demo.
In 2024–2025 we watched computer-use agents go from research prototypes to widely available developer tools: Anthropic shipped a Computer Use tool (public beta starting Oct 2024), OpenAI shipped Operator (Jan 2025) and then folded it into a broader ChatGPT agent experience by Aug 2025, Google pushed Project Mariner to wider access at I/O 2025 and later introduced a Gemini “Computer Use” model, and Microsoft open-sourced Windows automation stacks like UFO/UFO²/UFO³.
Under the hood, the shared lesson is the same as robotics: policy is downstream of the world. If you want reliable, long-horizon computer automation, your differentiator isn’t “prompting”—it’s the RL environment: how you reset, how you observe, how you act, how you grade, how you vary the UI, and how you keep the agent safe while it learns.
A good RL environment for computer-use tasks is not a VNC wrapper around a VM. It’s an engineered system that moves in one direction:
We’ll make this concrete by designing an RL environment around one task: Expense Report Autopilot. It’s a perfect “computer use” problem: multi-app, document handling, form fills, file pickers, intermittent popups, and a clean objective.
The agent’s job, end-to-end:
What makes this tricky isn’t the typing—it’s the interaction surface: inconsistent UI widgets, focus bugs, modal dialogs, scroll containers, and state that lives across applications.
The standard framing is a Markov Decision Process (MDP): . The agent observes a state , takes an action , transitions to , and receives reward . The objective:
The environment design determines whether this objective is learnable in practice. If resets are flaky, rewards are ambiguous, and observations are unstable, you’ll spend your entire training budget learning to recover from simulator drift.
At minimum, your env should look like this (even if it’s implemented on top of real VMs):
from dataclasses import dataclass
from typing import Any, Dict, Optional, Tuple
@dataclass
class Observation:
screenshot_png: bytes # pixels (primary)
cursor_xy: Tuple[int, int]
active_app: str # optional but useful
accessibility_tree: Optional[Dict] # optional (UIA/DOM)
ocr_text: Optional[str] # optional
@dataclass
class StepInfo:
episode_id: str
step: int
events: Dict[str, Any] # clicks/keys/app events
safety: Dict[str, Any] # policy violations, blocked actions
grader: Dict[str, Any] # intermediate grading signals
class ComputerUseEnv:
def reset(self, seed: int) -> Observation:
"""Reset VM + accounts + filesystem to a deterministic snapshot."""
raise NotImplementedError
def step(self, action: Dict[str, Any]) -> Tuple[Observation, float, bool, StepInfo]:
"""Execute one UI action and return (obs, reward, done, info)."""
raise NotImplementedErrorThis looks simple, but each field implies deep engineering choices. Let’s walk through the core design surfaces.
In UI RL, reset is everything. If reset takes 2 minutes, you won’t get enough episodes. If reset is inconsistent, your training signal becomes noise.
A production-grade reset strategy usually includes:
This is why companies like Mechanize explicitly market “RL environments” rather than “agents.” Their pitch is essentially: build digital workspaces that can be reset, graded, and scaled for training.
Most real computer-use APIs in 2024–2025 are built around screenshots + mouse/keyboard. Anthropic’s Computer Use tool is explicitly a loop where you provide screenshots and execute mouse/keyboard actions in your sandboxed environment. OpenAI’s computer-use guide is the same loop: the model emits UI actions, you execute them, and you return screenshots.
But pixels alone are expensive to learn from. If you can expose structure (DOM, accessibility labels, UIA trees), you can dramatically improve grounding and reward evaluation. The trade-off is realism: you don’t want to give the agent a magical oracle that won’t exist in production.
A common compromise is: pixels are the primary observation, but the env can optionally provide a structured tree for grounding and grading. Microsoft’s Windows automation stack (UFO and later UFO²) leaned hard into this by using Windows UI Automation (UIA) APIs for structured introspection and control rather than relying purely on coordinate clicks.
For an RL environment, this enables two big wins:
By late 2025, the community got loud about a specific failure mode: agents that look great on one UI skin collapse when the UI shifts by a few pixels. The cua-bench project (Dec 2025) is basically a thesis statement that visual diversity in the environment and data is what separates “demo agents” from “robust agents.”
So your observation design should include systematic variation knobs:
A “computer-use action space” is deceptively big. Even if you only support clicks and typing, you still have to decide:
Here’s a minimal but surprisingly effective schema:
{
"type": "click" | "double_click" | "right_click" | "drag" | "scroll" | "type" | "key",
"x": 742,
"y": 311,
"scroll_y": -580,
"text": "23.14",
"key": "ENTER",
"modifiers": ["CTRL"]
}This is close to what many “computer use” tools expose. It’s also where RL training gets ugly: coordinate clicking is noisy, and exploration is dangerous (a random click can delete, submit, or exfiltrate).
The best-performing systems increasingly blend GUI actions with API actionswhen available. The ComputerRL paper (Aug 2025) formalized this as an “API-GUI paradigm”—use APIs for precise state transitions when possible, and fall back to GUI when not.
For our Expense Report env, that means:
The key rule: APIs can be used for grading even if you disallow them for acting. That keeps training realistic while making reward reliable.
In long-horizon UI tasks, sparse rewards alone are brutal. If the only reward is “expense submitted,” the agent has to stumble into the right 40–120 step trajectory by luck. That’s not training—that’s lottery.
The fix is a reward stack:
A practical grading function for Expense Report looks like:
def grade_expense_portal(state) -> dict:
"""Returns verifiable checks without relying on fragile OCR."""
return {
"receipt_downloaded": state.fs.exists("/home/agent/Downloads/receipt.pdf"),
"portal_open": state.browser.url.startswith("https://expenses.example"),
"amount_ok": state.portal.form.amount == "23.14",
"currency_ok": state.portal.form.currency == "USD",
"date_ok": state.portal.form.date == "2025-12-18",
"attachment_ok": state.portal.attachments.contains("receipt.pdf"),
"submitted": state.portal.last_submission_status == "SUBMITTED",
}
def reward(checks: dict, done: bool, step_cost: float = 0.01) -> float:
r = -step_cost
r += 0.05 if checks["receipt_downloaded"] else 0.0
r += 0.10 if checks["amount_ok"] and checks["currency_ok"] else 0.0
r += 0.10 if checks["attachment_ok"] else 0.0
if done:
r += 1.0 if checks["submitted"] else -0.2
return r“But isn’t that cheating?” Only if the agent can directly read these checks. In RL, it’s normal for the environment to compute reward from privileged state. The realism constraint is: the agent’s observation should still be what it gets in production (screenshots and whatever structured metadata you’ll actually have). The grader can be privileged; the policy doesn’t have to be.
If your env runs on real apps, agents will discover degenerate strategies:
The environment must close these loopholes: locked-down user permissions, devtools disabled, egress restrictions, and “grader invariants” that validate the workflow actually occurred (e.g., event logs show file upload, not just an API call).
In 2024–2025, the ecosystem converged on a few anchor benchmarks:
UI-CUBE (Nov 2025) reports a sharp capability cliff: agents often score high on simple UI interactions, then collapse on complex workflows—exactly the failure mode you should design your environment to expose early.
| Benchmark | What it tests | Why it’s hard |
|---|---|---|
| OSWorld | Open-ended tasks on real OS apps (cross-app workflows) | Long horizon, diverse apps, real UI state |
| Windows Agent Arena | Windows-native tasks across real apps | OS-level complexity, evaluation at scale |
| WorkArena | Enterprise workflows on ServiceNow | Huge DOMs, dynamic enterprise UI patterns |
| WebArena / VisualWebArena | Long-horizon web tasks on realistic sites | Web unreliability, multi-page state, tool use |
But “success rate” alone hides what you care about in real deployments: reliability, cost, time, and how often a human has to intervene. A better evaluation report includes:
That robustness axis is the key “Dec 2025 upgrade.” If your eval doesn’t include UI variation, you are measuring memorization.
If an agent can browse the web, it can read adversarial text. If it can copy/paste, it can exfiltrate. If it can click “Submit,” it can do irreversible things. Safety is not a prompt—it’s a set of environment constraints.
At minimum, run training in a sandbox with:
This matches what major vendors recommend in their computer-use docs: run in isolated environments and treat the web as untrusted input.
Most teams converged on a three-phase pipeline:
Two notable 2025 ideas worth stealing:
The punchline: computer-use RL isn’t “just run PPO.” It’s systems engineering: rollout scale, reset speed, a grading pipeline, replay/selection, and safety constraints that let the agent explore without destroying your environment (or your company).
Mechanize (a 2025 startup) is unusually explicit about the lever they’re pulling: they build and sell RL environments that simulate real work (starting with software engineering), and they position themselves as infrastructure for AI labs rather than a consumer agent product.
Anthropic’s Computer Use tool docs (updated through late 2025) show a clear model: you run a VM/sandbox, Claude requests screenshot + mouse/keyboard actions, and your app executes them. They version the tool and require beta headers (e.g., computer-use-2025-01-24 and later updates in 2025).
OpenAI released Operator as a research preview in early 2025, then deprecated it and ended access by Aug 31, 2025, integrating the capabilities into a broader agent experience. For developers, OpenAI’s docs describe a computer-use model (“computer-use-preview”) accessed through the Responses API, again in the same execute-and-screenshot loop.
Google expanded access to Project Mariner around I/O 2025 (including parallel task execution in cloud VMs) and later introduced a Gemini 2.5 “Computer Use” model (Oct 2025) aimed at powering UI-interacting agents.
Microsoft’s open-source UFO line is a strong signal: pure pixels are not enough for reliable Windows automation. UFO² emphasizes deep Windows integration (UIA APIs / COM), and UFO³ broadens to multi-device orchestration. That’s “environment + control plane” thinking.
If you want to train a computer-use agent in 2026, don’t start with the model. Start with one environment that you can scale:
The uncomfortable truth: most “computer use agents” fail not because they can’t click—but because the world they train in is not designed to teach them.