Designing an RL Environment for Computer-Use Agents: The Expense Report Problem

December 22, 2025

I once gave a “computer-use” agent a task that sounded embarrassingly simple: “File this $23.14 ride receipt and submit the expense report.”

It opened the right email. Downloaded the PDF. Navigated to the company portal. Then it did what every brittle automation system does when it leaves the happy path: clicked the wrong dropdown, typed into a non-focused field, lost the file picker, and spiraled into a loop of “scroll, click, scroll, click” until the episode timed out.

The model wasn’t “dumb.” The task wasn’t “hard.” The real problem was that I hadn’t built an environment—I had built a demo.

In 2024–2025 we watched computer-use agents go from research prototypes to widely available developer tools: Anthropic shipped a Computer Use tool (public beta starting Oct 2024), OpenAI shipped Operator (Jan 2025) and then folded it into a broader ChatGPT agent experience by Aug 2025, Google pushed Project Mariner to wider access at I/O 2025 and later introduced a Gemini “Computer Use” model, and Microsoft open-sourced Windows automation stacks like UFO/UFO²/UFO³.

Under the hood, the shared lesson is the same as robotics: policy is downstream of the world. If you want reliable, long-horizon computer automation, your differentiator isn’t “prompting”—it’s the RL environment: how you reset, how you observe, how you act, how you grade, how you vary the UI, and how you keep the agent safe while it learns.

The thesis

A good RL environment for computer-use tasks is not a VNC wrapper around a VM. It’s an engineered system that moves in one direction:

Policy is downstream of the environment

If you can’t reset, observe, act, and grade reliably, RL won’t save you.

Task

Expense report autopilot (end-to-end workflow)

Reset

snapshots • clean accounts • stable seeds

Observe

screenshots + (optional) accessibility / DOM

Act

mouse • keyboard • scroll • (optional) API-GUI hybrid

Grade

verifiable end state + stepwise progress signals

Vary

themes • layout shifts • resolution • app versions

Guard

sandbox • allowlists • irreversible-action gates • full traces

Train + Evaluate

BC → RL (replay, entropy mgmt) → robustness metrics

Figure 1 — A vertical view of the thesis: environment design precedes policy quality.

We’ll make this concrete by designing an RL environment around one task: Expense Report Autopilot. It’s a perfect “computer use” problem: multi-app, document handling, form fills, file pickers, intermittent popups, and a clean objective.

Case study task: Expense Report Autopilot

The agent’s job, end-to-end:

Open an email inbox and find the latest receipt (PDF).
Download it to a known folder.
Open the expense portal in a browser.
Create a new expense with the correct vendor, date, currency, category, and amount.
Attach the PDF receipt.
Submit, and verify the report shows as submitted.

What makes this tricky isn’t the typing—it’s the interaction surface: inconsistent UI widgets, focus bugs, modal dialogs, scroll containers, and state that lives across applications.

Formalizing “computer use” as an MDP

The standard framing is a Markov Decision Process (MDP): $M = (S, A, P, R, \gamma)$ . The agent observes a state $s_t$ , takes an action $a_t$ , transitions to $s_{t+1}$ , and receives reward $r_t$ . The objective:

\max_\pi \; \mathbb{E}_{\pi}\left[\sum_{t=0}^{T} \gamma^{t}\, r_t\right]

The environment design determines whether this objective is learnable in practice. If resets are flaky, rewards are ambiguous, and observations are unstable, you’ll spend your entire training budget learning to recover from simulator drift.

Environment API: what you actually need

At minimum, your env should look like this (even if it’s implemented on top of real VMs):

An RL environment for computer use (single direction)

The loop is vertical: reset → observe → act → grade → learn.

Episode reset

VM snapshot + seeded UI variation + clean state

Observation bundle

screenshot • cursor • (optional) tree / OCR • active app

Policy (agent)

plans actions from pixels, learns from reward

Action executor

click • type • scroll • drag • hotkeys (guarded)

Real desktop apps

email → downloads → browser portal → file picker

Privileged grader

verifies receipt attached, values correct, status submitted

Reward + traces

step cost • progress rewards • event logs • screenshots

Learning stack

demos → BC → online RL (replay, entropy mgmt)

Figure 2 — A practical architecture that stays honest: the policy sees screens, the environment grades with verifiable checks.

Python

from dataclasses import dataclass
from typing import Any, Dict, Optional, Tuple

@dataclass
class Observation:
    screenshot_png: bytes               # pixels (primary)
    cursor_xy: Tuple[int, int]
    active_app: str                     # optional but useful
    accessibility_tree: Optional[Dict]  # optional (UIA/DOM)
    ocr_text: Optional[str]             # optional

@dataclass
class StepInfo:
    episode_id: str
    step: int
    events: Dict[str, Any]              # clicks/keys/app events
    safety: Dict[str, Any]              # policy violations, blocked actions
    grader: Dict[str, Any]              # intermediate grading signals

class ComputerUseEnv:
    def reset(self, seed: int) -> Observation:
        """Reset VM + accounts + filesystem to a deterministic snapshot."""
        raise NotImplementedError

    def step(self, action: Dict[str, Any]) -> Tuple[Observation, float, bool, StepInfo]:
        """Execute one UI action and return (obs, reward, done, info)."""
        raise NotImplementedError

This looks simple, but each field implies deep engineering choices. Let’s walk through the core design surfaces.

1) Reset and determinism: the unsexy part that makes or breaks training

In UI RL, reset is everything. If reset takes 2 minutes, you won’t get enough episodes. If reset is inconsistent, your training signal becomes noise.

A production-grade reset strategy usually includes:

VM snapshots (fast rollback)
Ephemeral accounts (fresh inbox + portal state per episode)
Seeded UI variation (same seed → same jitter/theme)
Network control (cache warming, pop-up suppression, deterministic latency profiles)

This is why companies like Mechanize explicitly market “RL environments” rather than “agents.” Their pitch is essentially: build digital workspaces that can be reset, graded, and scaled for training.

2) Observation design: pixels, structure, and “ground truth”

Most real computer-use APIs in 2024–2025 are built around screenshots + mouse/keyboard. Anthropic’s Computer Use tool is explicitly a loop where you provide screenshots and execute mouse/keyboard actions in your sandboxed environment. OpenAI’s computer-use guide is the same loop: the model emits UI actions, you execute them, and you return screenshots.

But pixels alone are expensive to learn from. If you can expose structure (DOM, accessibility labels, UIA trees), you can dramatically improve grounding and reward evaluation. The trade-off is realism: you don’t want to give the agent a magical oracle that won’t exist in production.

Pixel-only (max realism, max pain)

Pros: matches real deployment constraints; no hidden “cheat codes”
Cons: hard credit assignment; brittle to small UI changes; expensive perception

Pixels + accessibility / DOM (practical middle ground)

A common compromise is: pixels are the primary observation, but the env can optionally provide a structured tree for grounding and grading. Microsoft’s Windows automation stack (UFO and later UFO²) leaned hard into this by using Windows UI Automation (UIA) APIs for structured introspection and control rather than relying purely on coordinate clicks.

For an RL environment, this enables two big wins:

Reward grading: verify “amount field contains 23.14” without OCR.
Action abstraction: click elements by stable selectors instead of pixels (more on this below).

Variation is not optional (Dec 2025 lesson)

By late 2025, the community got loud about a specific failure mode: agents that look great on one UI skin collapse when the UI shifts by a few pixels. The cua-bench project (Dec 2025) is basically a thesis statement that visual diversity in the environment and data is what separates “demo agents” from “robust agents.”

So your observation design should include systematic variation knobs:

Resolution: 1280×720 vs 1920×1080 vs ultrawide
Theme: light/dark/high-contrast
Layout jitter: window positions, panel widths, zoom level
App versions: minor UI changes (the real world’s “domain shift”)

3) Action design: coordinates, semantics, and the API-GUI hybrid

A “computer-use action space” is deceptively big. Even if you only support clicks and typing, you still have to decide:

Granularity: raw (x,y) pixels vs normalized coords vs element-level actions
Typing: entire string vs keystroke-level
Scrolling: line vs page vs pixel deltas
Drag: start/end points, or “drag file to upload zone”
Multi-app control: app switching, window focus, clipboard

Here’s a minimal but surprisingly effective schema:

Json

{
  "type": "click" | "double_click" | "right_click" | "drag" | "scroll" | "type" | "key",
  "x": 742,
  "y": 311,
  "scroll_y": -580,
  "text": "23.14",
  "key": "ENTER",
  "modifiers": ["CTRL"]
}

This is close to what many “computer use” tools expose. It’s also where RL training gets ugly: coordinate clicking is noisy, and exploration is dangerous (a random click can delete, submit, or exfiltrate).

The 2025 pattern: hybridize actions

The best-performing systems increasingly blend GUI actions with API actionswhen available. The ComputerRL paper (Aug 2025) formalized this as an “API-GUI paradigm”—use APIs for precise state transitions when possible, and fall back to GUI when not.

For our Expense Report env, that means:

GUI: navigate the portal, handle weird UI widgets, attach receipts
API (optional): query inbox for “latest receipt” in training, or validate submission state

The key rule: APIs can be used for grading even if you disallow them for acting. That keeps training realistic while making reward reliable.

4) Reward design: verifiable outcomes, stepwise signals, and anti-cheating

In long-horizon UI tasks, sparse rewards alone are brutal. If the only reward is “expense submitted,” the agent has to stumble into the right 40–120 step trajectory by luck. That’s not training—that’s lottery.

The fix is a reward stack:

Terminal reward: +1 if the report is submitted and correct; 0 otherwise
Step penalty: small negative reward each step to encourage efficiency
Progress rewards: for completing subgoals (downloaded receipt, filled amount, attached file)
Safety penalties: for disallowed domains, suspicious copy/paste, or “submit” without review

A practical grading function for Expense Report looks like:

Python

def grade_expense_portal(state) -> dict:
    """Returns verifiable checks without relying on fragile OCR."""
    return {
        "receipt_downloaded": state.fs.exists("/home/agent/Downloads/receipt.pdf"),
        "portal_open": state.browser.url.startswith("https://expenses.example"),
        "amount_ok": state.portal.form.amount == "23.14",
        "currency_ok": state.portal.form.currency == "USD",
        "date_ok": state.portal.form.date == "2025-12-18",
        "attachment_ok": state.portal.attachments.contains("receipt.pdf"),
        "submitted": state.portal.last_submission_status == "SUBMITTED",
    }

def reward(checks: dict, done: bool, step_cost: float = 0.01) -> float:
    r = -step_cost
    r += 0.05 if checks["receipt_downloaded"] else 0.0
    r += 0.10 if checks["amount_ok"] and checks["currency_ok"] else 0.0
    r += 0.10 if checks["attachment_ok"] else 0.0
    if done:
        r += 1.0 if checks["submitted"] else -0.2
    return r

“But isn’t that cheating?” Only if the agent can directly read these checks. In RL, it’s normal for the environment to compute reward from privileged state. The realism constraint is: the agent’s observation should still be what it gets in production (screenshots and whatever structured metadata you’ll actually have). The grader can be privileged; the policy doesn’t have to be.

Anti-cheating matters more than you think

If your env runs on real apps, agents will discover degenerate strategies:

editing local config files to bypass UI steps
using browser devtools to mutate DOM state
triggering hidden endpoints instead of completing the workflow

The environment must close these loopholes: locked-down user permissions, devtools disabled, egress restrictions, and “grader invariants” that validate the workflow actually occurred (e.g., event logs show file upload, not just an API call).

5) Benchmarks and evaluation: stop measuring only success rate

In 2024–2025, the ecosystem converged on a few anchor benchmarks:

The reliability cliff (why eval must be more than accuracy)

A common 2025 finding: simple UI tasks look fine; complex workflows collapse.

Figure 3 — Measuring robustness and workflow reliability is mandatory for deployment.

UI-CUBE (Nov 2025) reports a sharp capability cliff: agents often score high on simple UI interactions, then collapse on complex workflows—exactly the failure mode you should design your environment to expose early.

Benchmark	What it tests	Why it’s hard
OSWorld	Open-ended tasks on real OS apps (cross-app workflows)	Long horizon, diverse apps, real UI state
Windows Agent Arena	Windows-native tasks across real apps	OS-level complexity, evaluation at scale
WorkArena	Enterprise workflows on ServiceNow	Huge DOMs, dynamic enterprise UI patterns
WebArena / VisualWebArena	Long-horizon web tasks on realistic sites	Web unreliability, multi-page state, tool use

But “success rate” alone hides what you care about in real deployments: reliability, cost, time, and how often a human has to intervene. A better evaluation report includes:

Completion rate: did it finish correctly?
Intervention rate: how often did a human have to rescue it?
Action efficiency: steps to completion (or median time)
Cost: tokens + compute + tool calls per successful run
Safety violations: blocked domains, unsafe actions, prompt-injection susceptibility
Robustness: performance across UI variants (themes, layout shifts, app updates)

That robustness axis is the key “Dec 2025 upgrade.” If your eval doesn’t include UI variation, you are measuring memorization.

6) Safety: prompt injection is an environment problem

If an agent can browse the web, it can read adversarial text. If it can copy/paste, it can exfiltrate. If it can click “Submit,” it can do irreversible things. Safety is not a prompt—it’s a set of environment constraints.

At minimum, run training in a sandbox with:

Network allowlists (only the domains required for the task)
Ephemeral credentials (short-lived tokens, no real bank accounts, no real email)
Tool guards (block “download executable,” “open devtools,” “open settings,” etc.)
Confirm gates for irreversible actions (submit, purchase, send)
Full recording (screenshots + action logs + DOM/UIA snapshots) for post-mortems

This matches what major vendors recommend in their computer-use docs: run in isolated environments and treat the web as untrusted input.

7) Training recipe (2025 best practice): demos → RL with replay → entropy management

Most teams converged on a three-phase pipeline:

Demonstrations: record expert trajectories (humans, scripted oracles, or hybrid)
Behavior cloning (BC): cold-start the policy so it can “do something”
Online RL: optimize with verifiable reward + constrained exploration

Two notable 2025 ideas worth stealing:

Experience replay for UI RL: ARPO (May 2025) augments GRPO with a replay buffer so the agent can reuse rare successful trajectories in sparse-reward GUI settings.
Entropy collapse mitigation: ComputerRL (Aug 2025) proposes Entropulse, alternating RL and supervised fine-tuning to keep policies from becoming overconfident too early in long runs.

The punchline: computer-use RL isn’t “just run PPO.” It’s systems engineering: rollout scale, reset speed, a grading pipeline, replay/selection, and safety constraints that let the agent explore without destroying your environment (or your company).

Industry snapshot (Dec 2025): who’s betting on what

Mechanize: sell the environment, not the agent

Mechanize (a 2025 startup) is unusually explicit about the lever they’re pulling: they build and sell RL environments that simulate real work (starting with software engineering), and they position themselves as infrastructure for AI labs rather than a consumer agent product.

Anthropic: “Computer Use” as a tool loop

Anthropic’s Computer Use tool docs (updated through late 2025) show a clear model: you run a VM/sandbox, Claude requests screenshot + mouse/keyboard actions, and your app executes them. They version the tool and require beta headers (e.g., computer-use-2025-01-24 and later updates in 2025).

OpenAI: Operator → ChatGPT agent + a computer-use model for developers

OpenAI released Operator as a research preview in early 2025, then deprecated it and ended access by Aug 31, 2025, integrating the capabilities into a broader agent experience. For developers, OpenAI’s docs describe a computer-use model (“computer-use-preview”) accessed through the Responses API, again in the same execute-and-screenshot loop.

Google DeepMind: Project Mariner + Gemini “Computer Use” model

Google expanded access to Project Mariner around I/O 2025 (including parallel task execution in cloud VMs) and later introduced a Gemini 2.5 “Computer Use” model (Oct 2025) aimed at powering UI-interacting agents.

Microsoft: structured OS hooks (UFO → UFO² → UFO³)

Microsoft’s open-source UFO line is a strong signal: pure pixels are not enough for reliable Windows automation. UFO² emphasizes deep Windows integration (UIA APIs / COM), and UFO³ broadens to multi-device orchestration. That’s “environment + control plane” thinking.

What to build next

If you want to train a computer-use agent in 2026, don’t start with the model. Start with one environment that you can scale:

Pick a single workflow with a verifiable end state (expense report, ticket triage, invoice entry).
Build deterministic resets and a grader that can’t be gamed.
Add UI variation (theme/layout/resolution/app versions) early.
Record expert demos and bootstrap with BC.
Add online RL with replay + safety constraints.

The uncomfortable truth: most “computer use agents” fail not because they can’t click—but because the world they train in is not designed to teach them.

References (selected)

Mechanize: mechanize.work
TechCrunch on Mechanize (Apr 2025): techcrunch.com/…/launches-controversial-startup…
Anthropic Computer Use announcement (Oct 2024): anthropic.com/news/3-5-models-and-computer-use
Anthropic Computer Use tool docs (updated through 2025): platform.claude.com/docs/…/computer-use-tool
OpenAI computer use guide: platform.openai.com/docs/guides/tools-computer-use
Operator deprecation notice (Aug 2025): help.openai.com/…/operator
MIT Technology Review on Operator (Jan 2025): technologyreview.com/…/openai-launches-operator…
Project Mariner overview: deepmind.google/models/project-mariner
Gemini 2.5 Computer Use model announcement (Oct 2025): blog.google/…/gemini-computer-use-model
Microsoft UFO project: github.com/microsoft/UFO
OSWorld benchmark: os-world.github.io
Windows Agent Arena: microsoft.com/…/windows-agent-arena
WorkArena benchmark: servicenow.github.io/WorkArena
WebArena paper: webarena.dev/static/paper.pdf
ARPO (May 2025): arXiv:2505.16282
ComputerRL (Aug 2025): arXiv:2508.14040
UI-CUBE (Nov 2025): arXiv:2511.17131
cua-bench (Dec 2025): huggingface.co/blog/cua-ai/cua-bench