PRO-CUA: Process-Reward Optimization for Computer Use Agents

Overview

Self-evolvement with Step-level RL

Computer use agents can automate complex digital workflows, but their training is limited by costly live environment interaction and scarce high-quality supervision. Filtered behavior cloning suffers from imitation bottlenecks and lacks negative learning signals, while trajectory-level reinforcement learning faces sparse rewards, ambiguous credit assignment, and expensive long-horizon GUI interaction.

PRO-CUA addresses these constraints with a process-reward optimization framework. The current policy first collects states through live rollouts. For each collected state, the agent samples diverse candidate actions, a process reward model gives step-level feedback, and GRPO updates the policy using group-relative advantages. This creates dense credit assignment without golden answers or offline expert trajectories.

Key Contributions

A scalable recipe for training CUAs

Tailored CUA infrastructure design

PRO-CUA decouples environment interaction and model training, avoiding the systems burden of simultaneously performing agent rollout, live GUI interaction, and policy optimization.

On-policy state collection

The framework removes reliance on offline expert demonstrations and trains the agent on states sampled from its own execution distribution, including the difficult states it actually encounters.

Dense and flexible credit assignment via PRMs

PRO-CUA replaces sparse trajectory-level rewards with PRM-graded step-level GRPO, providing fine-grained supervision without requiring golden answers from expert demonstrations.

Method

Two stages, one self-evolving loop

01

On-policy state collection

The current policy interacts with live web environments at an elevated sampling temperature. The resulting task instructions, screenshots, and action histories form a state dataset drawn from the agent's own behavior.

02

Step-level rollout

For each collected state, the policy samples multiple candidate thought-action pairs. These candidates are newly generated and are not treated as references for imitation.

03

PRM grading

A multimodal process reward model evaluates whether each proposed action is visually grounded, non-redundant, and useful for task progress, returning a binary step-level reward.

04

GRPO policy update

Group-relative advantages reinforce locally better actions and reduce pressure to imitate a single expert reasoning trace, allowing valid alternative action paths.

Baselines

Why compare against FBC and rule-based Step-RL?

We compare PRO-CUA against two key self-evolving baselines that represent common ways to turn live CUA rollouts into training data. Filtered Behavior Cloning (FBC) keeps only successful trajectories and trains the policy to imitate their observed thought-action sequences. This is simple and stable, but it discards failed trajectories, overfits to easy successes, and gives no negative learning signal for recovering from mistakes.

Rule-based Step-RL uses successful trajectories as golden references, samples candidate actions at each retained state, and rewards candidates that match the reference action in type, target, and text input. This gives denser feedback than FBC, but it still depends on successful rollouts and golden actions. It can penalize valid alternative actions and cannot naturally learn from failed states where no verified reference action is available.

PRM Grading Pipeline

How PRO-CUA grades candidate actions

Since candidate actions in the optimization stage are not executed in the live environment, PRO-CUA asks the PRM to evaluate the proposed action directly from the current multimodal context. The screenshot is annotated at the target coordinates so the PRM can judge whether the action is visually grounded and useful for task progress.

PRM grading pipeline example with annotated screenshot — **Process Reward Model grading pipeline.** The PRM receives a multimodal step context comprising the task instruction, the agent's action history, the proposed current action, and an annotated screenshot. For readability, the figure shows a zoomed-in crop, while the actual PRM input contains the full web interface. Based on this augmented context, the PRM reasons about whether the proposed action functionally advances the task and outputs a binary step-level reward.

Reliability of the PRM Reward Signal

PRM feedback provides useful step-level supervision

A central question for PRO-CUA is whether process reward model feedback is reliable enough to optimize the policy. To isolate the reward source, we run a controlled ablation on WebVoyager where rule-based rewards and PRM-based rewards train on the same subset of states from successful trajectories. This setting favors the rule-based baseline because golden reference actions are available and failed trajectories are excluded.

Comparison of reward sources for step-level RL. Under the same successful-trajectory training subset, PRM-based rewards outperform rule-based rewards, suggesting that visually grounded PRM feedback provides useful supervision for policy optimization.
Method	Reward type	Success rate
Base model	/	27.5
Step-RL	Rule-based	34.7
Step-RL	Qwen3-VL-4B PRM	36.6
Step-RL	GPT-5-mini PRM	36.8

Both PRM variants outperform the rule-based reward despite using the same successful-state subset. The result indicates that visually grounded functional evaluation can be a stronger step-level training signal than exact matching to a single demonstration action.

Moving average of PRM rewards during training — **Step-level rewards assigned during training.** We plot moving averages of binary PRM rewards. GPT-5-mini assigns more conservative rewards, while Qwen3-VL-4B is more lenient on average. Despite this calibration gap, both PRMs achieve similar downstream policy performance, suggesting that GRPO is robust to reward strictness differences through group normalization.

The moving-average reward curves show a clear calibration gap: GPT-5-mini is stricter, while Qwen3-VL-4B assigns positive rewards more often. PRO-CUA does not require the absolute reward scale to be perfectly calibrated, because GRPO computes mean-centered advantages within each sampled group. The PRM mainly needs to distinguish better and worse candidate actions locally; the training signal is then aggregated across many states, sampled actions, and updates.

Main Results

Strong gains on live web benchmarks

PRO-CUA improves over filtered behavior cloning and rule-based Step-RL while using no external expert steps in the controlled self-evolving setting.

Task success rates on online web benchmarks
Training paradigm	Method	Ext. expert steps	WebVoyager	Mind2Web-Live	OnlineMind2Web
External expert / closed data	UI-TARS-1.5-7B	Closed data	30.3	18.1	14.6
	WebSTAR-7B	100K	47.0	17.0	17.0
	WebSTAR-32B	100K	53.5	20.4	23.8
	GUI-Libra-4B	81K	-	-	20.0
	GUI-Libra-8B	81K	-	-	19.3
Self-evolving 4B	Qwen3-VL-4B-Instruct	0	27.5	18.1	16.7
	FBC	0	29.7	26.4	23.7
	Rule-based Step-RL	0	34.7	27.8	29.9
	PRO-CUA	0	42.4	34.7	28.8
Self-evolving 8B	Qwen3-VL-8B-Instruct	0	25.6	20.8	12.2
	FBC	0	31.8	23.6	26.9
	Rule-based Step-RL	0	33.8	25.0	26.2
	PRO-CUA	0	43.2	30.6	28.2

+12.7 WebVoyager gain over FBC, 4B

+7.7 WebVoyager gain over rule-based Step-RL, 4B

0 external expert steps in controlled runs

Data Utilization

Learning from more of each rollout budget

PRO-CUA's process rewards allow the training loop to use both successful and failed finished trajectories. This differs from FBC and rule-based Step-RL, which rely on successful rollouts: FBC needs complete successful trajectories for imitation, while rule-based Step-RL needs golden reference actions at each retained state.

**Data utilization across training iterations.** PRO-CUA consistently yields more usable step-level training data than FBC and rule-based Step-RL because process rewards allow learning from both successful and failed finished trajectories, while the baselines rely on successful rollouts.

In practice, agents may spend many steps stuck on the same page, so we apply lightweight filtering and retain states from finished trajectories. Finished trajectories are not necessarily successful: the final answer may still be wrong. Because PRM grading does not require a golden next action, these failed-but-finished trajectories can still produce useful step-level supervision, improving the fraction of on-policy interaction that becomes deployable training data.

Citation

BibTeX

@article{he2026pro,
  title={PRO-CUA: Process-Reward Optimization for Computer Use Agents},
  author={He, Yifei and Yang, Rui and Bai, Hao and Zhang, Tong and Zhao, Han},
  journal={arXiv preprint arXiv:2605.29119},
  year={2026}
}