WebSTAR: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Yifei He*1, Pranit Chawla*2, Yaser Souri2, Subhojit Som2, Xia Song2
1 UIUC logo UIUC 2 Microsoft logo Microsoft

The WebSTAR Dataset

WebSTAR (WebVoyager Step-Level Trajectories with Augmented Reasoning) is a large-scale dataset for training and evaluating computer use agents with step-level quality scores. It contains 13.3K trajectories with 100K total steps synthesized from OpenAI's Operator model. Unlike traditional trajectory-level filtering approaches, WebSTAR provides fine-grained scores and detailed reasoning for each action in an agent's trajectory, enabling more precise quality assessment and selective training on high-quality steps.

Abstract

Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI's computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses the SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal process reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training, and we introduce two new datasets (WebSTAR, WebSCORE) along with a lightweight process reward model (StepRM) as practical tools to advance robust and efficient CUAs.

Data Synthesis Pipeline

Data Synthesis Pipeline

Overview of our data synthesis pipeline. Our approach consists of three main stages:

  • (i) Trajectory collection: We begin by rolling out a teacher CUA in a Chromium environment, executing actions based on user instructions and capturing observations and screenshots.
  • (ii) Thought augmentation & step grading: For each step, the action and screenshot trajectories are passed through a model to generate an intermediate thought to guide the action. A grading model also receives the same input to evaluate the current step, assigning a score from 0 to 10.
  • (iii) Step-level filtering: We retain only high-scoring steps to ensure that the agent only learns from high-quality actions.

Results

WebVoyager Results (7B)
WebVoyager Results (32B)

The results in the WebVoyager figure and Mind2Web table highlight the clear advantages of step-level filtering for CUA SFT. In the WebVoyager benchmark, training on only the correct steps in correct trajectories consistently yields stronger performance across almost all domains compared to training on all steps within correct trajectories. Under this setting, our 7B step-level model achieves an average success rate of 47.0%, substantially outperforming the SoTA general-purpose CUA UI-TARS-1.5-7B (30.0%) despite using only SFT. This demonstrates that carefully filtered training data can rival and even surpass the performance of stronger baselines trained with more complex methods. The full numerical results and statistical significance test are presented in the appendix.

These findings demonstrate that step-level filtering is critical for effective offline training of CUAs. By removing incorrect or suboptimal intermediate actions while preserving correct steps from successful trajectories, the resulting training data better aligns with the step-level prediction objective. This leads to significant improvements in downstream performance, offering a simple yet powerful strategy for advancing the robustness and efficiency of CUA training.

Further Analysis

Scaling behavior of step-level vs. trajectory-level filtering

Scaling behavior of step-level vs. trajectory-level filtering.

Step-level score distribution from o4-mini

Step-level score distribution from o4-mini.

Performance with data scaling. The figure shows the trend of average performance across the three benchmarks with increasing amount of data under different model sizes. The results highlight that step-level filtering not only improves performance but also scales more effectively with additional data compared to trajectory-level filtering. In the low-data regime (around 25K examples), models trained on correct steps already achieve substantially higher success rates than those trained on full trajectories. As the number of examples increases, the advantage of step-level filtering becomes even more pronounced: both the 7B and 32B models continue to gain steadily, while trajectory-level filtering plateaus early, particularly for the 7B model.

Score distribution. The CDF illustrates the cumulative distribution of step-level scores. The distribution is computed only over the successful trajectories, as our final goal is to only retain the correct steps in the successful trajectories. Even within those successful trajectories, a majority of actions receive low scores: over half fall below 5. This confirms that trajectory-level filtering, which accepts whole rollouts as long as the final task succeeds, inevitably includes many flawed intermediate steps. Such noisy supervision can mislead agents during training. Step-level filtering directly addresses this issue by isolating the small fraction of high-quality actions (scores close to 10) while discarding the majority of low-quality ones. This reinforces our key insight that filtering at the step granularity is essential for constructing cleaner and more reliable training data for CUAs.

BibTeX

@misc{he2026webstarscalabledatasynthesis,
            title={WebSTAR: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering}, 
            author={Yifei He and Pranit Chawla and Yaser Souri and Subhojit Som and Xia Song},
            year={2026},
            eprint={2512.10962},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2512.10962}, 
      }