← Back

Designing an RL Environment for Computer-Use Agents: The Expense Report Problem

December 22, 2025

I once gave a “computer-use” agent a task that sounded embarrassingly simple: “File this $23.14 ride receipt and submit the expense report.”

It opened the right email. Downloaded the PDF. Navigated to the company portal. Then it did what every brittle automation system does when it leaves the happy path: clicked the wrong dropdown, typed into a non-focused field, lost the file picker, and spiraled into a loop of “scroll, click, scroll, click” until the episode timed out.

The model wasn’t “dumb.” The task wasn’t “hard.” The real problem was that I hadn’t built an environment—I had built a demo.

In 2024–2025 we watched computer-use agents go from research prototypes to widely available developer tools: Anthropic shipped a Computer Use tool (public beta starting Oct 2024), OpenAI shipped Operator (Jan 2025) and then folded it into a broader ChatGPT agent experience by Aug 2025, Google pushed Project Mariner to wider access at I/O 2025 and later introduced a Gemini “Computer Use” model, and Microsoft open-sourced Windows automation stacks like UFO/UFO²/UFO³.

Under the hood, the shared lesson is the same as robotics: policy is downstream of the world. If you want reliable, long-horizon computer automation, your differentiator isn’t “prompting”—it’s the RL environment: how you reset, how you observe, how you act, how you grade, how you vary the UI, and how you keep the agent safe while it learns.

The thesis

A good RL environment for computer-use tasks is not a VNC wrapper around a VM. It’s an engineered system that moves in one direction:

Policy is downstream of the environment
If you can’t reset, observe, act, and grade reliably, RL won’t save you.
Task
Expense report autopilot (end-to-end workflow)
Reset
snapshots • clean accounts • stable seeds
Observe
screenshots + (optional) accessibility / DOM
Act
mouse • keyboard • scroll • (optional) API-GUI hybrid
Grade
verifiable end state + stepwise progress signals
Vary
themes • layout shifts • resolution • app versions
Guard
sandbox • allowlists • irreversible-action gates • full traces
Train + Evaluate
BC → RL (replay, entropy mgmt) → robustness metrics
Figure 1 — A vertical view of the thesis: environment design precedes policy quality.

We’ll make this concrete by designing an RL environment around one task: Expense Report Autopilot. It’s a perfect “computer use” problem: multi-app, document handling, form fills, file pickers, intermittent popups, and a clean objective.

Case study task: Expense Report Autopilot

The agent’s job, end-to-end:

  1. Open an email inbox and find the latest receipt (PDF).
  2. Download it to a known folder.
  3. Open the expense portal in a browser.
  4. Create a new expense with the correct vendor, date, currency, category, and amount.
  5. Attach the PDF receipt.
  6. Submit, and verify the report shows as submitted.

What makes this tricky isn’t the typing—it’s the interaction surface: inconsistent UI widgets, focus bugs, modal dialogs, scroll containers, and state that lives across applications.

Formalizing “computer use” as an MDP

The standard framing is a Markov Decision Process (MDP): M=(S,A,P,R,γ)M = (S, A, P, R, \gamma). The agent observes a state sts_t, takes an action ata_t, transitions to st+1s_{t+1}, and receives reward rtr_t. The objective:

maxπ  Eπ[t=0Tγtrt]\max_\pi \; \mathbb{E}_{\pi}\left[\sum_{t=0}^{T} \gamma^{t}\, r_t\right]

The environment design determines whether this objective is learnable in practice. If resets are flaky, rewards are ambiguous, and observations are unstable, you’ll spend your entire training budget learning to recover from simulator drift.

Environment API: what you actually need

At minimum, your env should look like this (even if it’s implemented on top of real VMs):

An RL environment for computer use (single direction)
The loop is vertical: reset → observe → act → grade → learn.
Episode reset
VM snapshot + seeded UI variation + clean state
Observation bundle
screenshot • cursor • (optional) tree / OCR • active app
Policy (agent)
plans actions from pixels, learns from reward
Action executor
click • type • scroll • drag • hotkeys (guarded)
Real desktop apps
email → downloads → browser portal → file picker
Privileged grader
verifies receipt attached, values correct, status submitted
Reward + traces
step cost • progress rewards • event logs • screenshots
Learning stack
demos → BC → online RL (replay, entropy mgmt)
Figure 2 — A practical architecture that stays honest: the policy sees screens, the environment grades with verifiable checks.
Python
from dataclasses import dataclass
from typing import Any, Dict, Optional, Tuple

@dataclass
class Observation:
    screenshot_png: bytes               # pixels (primary)
    cursor_xy: Tuple[int, int]
    active_app: str                     # optional but useful
    accessibility_tree: Optional[Dict]  # optional (UIA/DOM)
    ocr_text: Optional[str]             # optional

@dataclass
class StepInfo:
    episode_id: str
    step: int
    events: Dict[str, Any]              # clicks/keys/app events
    safety: Dict[str, Any]              # policy violations, blocked actions
    grader: Dict[str, Any]              # intermediate grading signals

class ComputerUseEnv:
    def reset(self, seed: int) -> Observation:
        """Reset VM + accounts + filesystem to a deterministic snapshot."""
        raise NotImplementedError

    def step(self, action: Dict[str, Any]) -> Tuple[Observation, float, bool, StepInfo]:
        """Execute one UI action and return (obs, reward, done, info)."""
        raise NotImplementedError

This looks simple, but each field implies deep engineering choices. Let’s walk through the core design surfaces.

1) Reset and determinism: the unsexy part that makes or breaks training

In UI RL, reset is everything. If reset takes 2 minutes, you won’t get enough episodes. If reset is inconsistent, your training signal becomes noise.

A production-grade reset strategy usually includes:

This is why companies like Mechanize explicitly market “RL environments” rather than “agents.” Their pitch is essentially: build digital workspaces that can be reset, graded, and scaled for training.

2) Observation design: pixels, structure, and “ground truth”

Most real computer-use APIs in 2024–2025 are built around screenshots + mouse/keyboard. Anthropic’s Computer Use tool is explicitly a loop where you provide screenshots and execute mouse/keyboard actions in your sandboxed environment. OpenAI’s computer-use guide is the same loop: the model emits UI actions, you execute them, and you return screenshots.

But pixels alone are expensive to learn from. If you can expose structure (DOM, accessibility labels, UIA trees), you can dramatically improve grounding and reward evaluation. The trade-off is realism: you don’t want to give the agent a magical oracle that won’t exist in production.

Pixel-only (max realism, max pain)

Pixels + accessibility / DOM (practical middle ground)

A common compromise is: pixels are the primary observation, but the env can optionally provide a structured tree for grounding and grading. Microsoft’s Windows automation stack (UFO and later UFO²) leaned hard into this by using Windows UI Automation (UIA) APIs for structured introspection and control rather than relying purely on coordinate clicks.

For an RL environment, this enables two big wins:

Variation is not optional (Dec 2025 lesson)

By late 2025, the community got loud about a specific failure mode: agents that look great on one UI skin collapse when the UI shifts by a few pixels. The cua-bench project (Dec 2025) is basically a thesis statement that visual diversity in the environment and data is what separates “demo agents” from “robust agents.”

So your observation design should include systematic variation knobs:

3) Action design: coordinates, semantics, and the API-GUI hybrid

A “computer-use action space” is deceptively big. Even if you only support clicks and typing, you still have to decide:

Here’s a minimal but surprisingly effective schema:

Json
{
  "type": "click" | "double_click" | "right_click" | "drag" | "scroll" | "type" | "key",
  "x": 742,
  "y": 311,
  "scroll_y": -580,
  "text": "23.14",
  "key": "ENTER",
  "modifiers": ["CTRL"]
}

This is close to what many “computer use” tools expose. It’s also where RL training gets ugly: coordinate clicking is noisy, and exploration is dangerous (a random click can delete, submit, or exfiltrate).

The 2025 pattern: hybridize actions

The best-performing systems increasingly blend GUI actions with API actionswhen available. The ComputerRL paper (Aug 2025) formalized this as an “API-GUI paradigm”—use APIs for precise state transitions when possible, and fall back to GUI when not.

For our Expense Report env, that means:

The key rule: APIs can be used for grading even if you disallow them for acting. That keeps training realistic while making reward reliable.

4) Reward design: verifiable outcomes, stepwise signals, and anti-cheating

In long-horizon UI tasks, sparse rewards alone are brutal. If the only reward is “expense submitted,” the agent has to stumble into the right 40–120 step trajectory by luck. That’s not training—that’s lottery.

The fix is a reward stack:

A practical grading function for Expense Report looks like:

Python
def grade_expense_portal(state) -> dict:
    """Returns verifiable checks without relying on fragile OCR."""
    return {
        "receipt_downloaded": state.fs.exists("/home/agent/Downloads/receipt.pdf"),
        "portal_open": state.browser.url.startswith("https://expenses.example"),
        "amount_ok": state.portal.form.amount == "23.14",
        "currency_ok": state.portal.form.currency == "USD",
        "date_ok": state.portal.form.date == "2025-12-18",
        "attachment_ok": state.portal.attachments.contains("receipt.pdf"),
        "submitted": state.portal.last_submission_status == "SUBMITTED",
    }

def reward(checks: dict, done: bool, step_cost: float = 0.01) -> float:
    r = -step_cost
    r += 0.05 if checks["receipt_downloaded"] else 0.0
    r += 0.10 if checks["amount_ok"] and checks["currency_ok"] else 0.0
    r += 0.10 if checks["attachment_ok"] else 0.0
    if done:
        r += 1.0 if checks["submitted"] else -0.2
    return r

“But isn’t that cheating?” Only if the agent can directly read these checks. In RL, it’s normal for the environment to compute reward from privileged state. The realism constraint is: the agent’s observation should still be what it gets in production (screenshots and whatever structured metadata you’ll actually have). The grader can be privileged; the policy doesn’t have to be.

Anti-cheating matters more than you think

If your env runs on real apps, agents will discover degenerate strategies:

The environment must close these loopholes: locked-down user permissions, devtools disabled, egress restrictions, and “grader invariants” that validate the workflow actually occurred (e.g., event logs show file upload, not just an API call).

5) Benchmarks and evaluation: stop measuring only success rate

In 2024–2025, the ecosystem converged on a few anchor benchmarks:

The reliability cliff (why eval must be more than accuracy)
A common 2025 finding: simple UI tasks look fine; complex workflows collapse.
Figure 3 — Measuring robustness and workflow reliability is mandatory for deployment.

UI-CUBE (Nov 2025) reports a sharp capability cliff: agents often score high on simple UI interactions, then collapse on complex workflows—exactly the failure mode you should design your environment to expose early.

BenchmarkWhat it testsWhy it’s hard
OSWorldOpen-ended tasks on real OS apps (cross-app workflows)Long horizon, diverse apps, real UI state
Windows Agent ArenaWindows-native tasks across real appsOS-level complexity, evaluation at scale
WorkArenaEnterprise workflows on ServiceNowHuge DOMs, dynamic enterprise UI patterns
WebArena / VisualWebArenaLong-horizon web tasks on realistic sitesWeb unreliability, multi-page state, tool use

But “success rate” alone hides what you care about in real deployments: reliability, cost, time, and how often a human has to intervene. A better evaluation report includes:

That robustness axis is the key “Dec 2025 upgrade.” If your eval doesn’t include UI variation, you are measuring memorization.

6) Safety: prompt injection is an environment problem

If an agent can browse the web, it can read adversarial text. If it can copy/paste, it can exfiltrate. If it can click “Submit,” it can do irreversible things. Safety is not a prompt—it’s a set of environment constraints.

At minimum, run training in a sandbox with:

This matches what major vendors recommend in their computer-use docs: run in isolated environments and treat the web as untrusted input.

7) Training recipe (2025 best practice): demos → RL with replay → entropy management

Most teams converged on a three-phase pipeline:

  1. Demonstrations: record expert trajectories (humans, scripted oracles, or hybrid)
  2. Behavior cloning (BC): cold-start the policy so it can “do something”
  3. Online RL: optimize with verifiable reward + constrained exploration

Two notable 2025 ideas worth stealing:

The punchline: computer-use RL isn’t “just run PPO.” It’s systems engineering: rollout scale, reset speed, a grading pipeline, replay/selection, and safety constraints that let the agent explore without destroying your environment (or your company).

Industry snapshot (Dec 2025): who’s betting on what

Mechanize: sell the environment, not the agent

Mechanize (a 2025 startup) is unusually explicit about the lever they’re pulling: they build and sell RL environments that simulate real work (starting with software engineering), and they position themselves as infrastructure for AI labs rather than a consumer agent product.

Anthropic: “Computer Use” as a tool loop

Anthropic’s Computer Use tool docs (updated through late 2025) show a clear model: you run a VM/sandbox, Claude requests screenshot + mouse/keyboard actions, and your app executes them. They version the tool and require beta headers (e.g., computer-use-2025-01-24 and later updates in 2025).

OpenAI: Operator → ChatGPT agent + a computer-use model for developers

OpenAI released Operator as a research preview in early 2025, then deprecated it and ended access by Aug 31, 2025, integrating the capabilities into a broader agent experience. For developers, OpenAI’s docs describe a computer-use model (“computer-use-preview”) accessed through the Responses API, again in the same execute-and-screenshot loop.

Google DeepMind: Project Mariner + Gemini “Computer Use” model

Google expanded access to Project Mariner around I/O 2025 (including parallel task execution in cloud VMs) and later introduced a Gemini 2.5 “Computer Use” model (Oct 2025) aimed at powering UI-interacting agents.

Microsoft: structured OS hooks (UFO → UFO² → UFO³)

Microsoft’s open-source UFO line is a strong signal: pure pixels are not enough for reliable Windows automation. UFO² emphasizes deep Windows integration (UIA APIs / COM), and UFO³ broadens to multi-device orchestration. That’s “environment + control plane” thinking.

What to build next

If you want to train a computer-use agent in 2026, don’t start with the model. Start with one environment that you can scale:

  1. Pick a single workflow with a verifiable end state (expense report, ticket triage, invoice entry).
  2. Build deterministic resets and a grader that can’t be gamed.
  3. Add UI variation (theme/layout/resolution/app versions) early.
  4. Record expert demos and bootstrap with BC.
  5. Add online RL with replay + safety constraints.

The uncomfortable truth: most “computer use agents” fail not because they can’t click—but because the world they train in is not designed to teach them.

References (selected)

  1. Mechanize: mechanize.work
  2. TechCrunch on Mechanize (Apr 2025): techcrunch.com/…/launches-controversial-startup…
  3. Anthropic Computer Use announcement (Oct 2024): anthropic.com/news/3-5-models-and-computer-use
  4. Anthropic Computer Use tool docs (updated through 2025): platform.claude.com/docs/…/computer-use-tool
  5. OpenAI computer use guide: platform.openai.com/docs/guides/tools-computer-use
  6. Operator deprecation notice (Aug 2025): help.openai.com/…/operator
  7. MIT Technology Review on Operator (Jan 2025): technologyreview.com/…/openai-launches-operator…
  8. Project Mariner overview: deepmind.google/models/project-mariner
  9. Gemini 2.5 Computer Use model announcement (Oct 2025): blog.google/…/gemini-computer-use-model
  10. Microsoft UFO project: github.com/microsoft/UFO
  11. OSWorld benchmark: os-world.github.io
  12. Windows Agent Arena: microsoft.com/…/windows-agent-arena
  13. WorkArena benchmark: servicenow.github.io/WorkArena
  14. WebArena paper: webarena.dev/static/paper.pdf
  15. ARPO (May 2025): arXiv:2505.16282
  16. ComputerRL (Aug 2025): arXiv:2508.14040
  17. UI-CUBE (Nov 2025): arXiv:2511.17131
  18. cua-bench (Dec 2025): huggingface.co/blog/cua-ai/cua-bench