Researchers are tackling the difficult problem of creating truly autonomous web agents, capable of navigating and interacting with the internet like a human assistant. Hang Ding from Shanghai Jiao Tong University, Peidong Liu from Sichuan University, and Junqiao Wang et al. present a new framework, DynaWeb, which utilises model-based reinforcement learning to train these agents within a simulated web environment. This approach circumvents the inefficiencies and risks associated with direct interaction with the live web, offering a scalable and cost-effective solution. Demonstrating significant performance gains on benchmarks like WebArena and WebVoyager, DynaWeb establishes the potential of ‘learning by imagination’ and paves the way for more powerful and efficient online agentic reinforcement learning.
DynaWeb training with expert trajectory interleaving improves robotic
Scientists are developing DynaWeb for efficient online reinforcement learning. The paradigm of artificial intelligence is rapidly shifting toward proactive, agentic systems that can autonomously execute complex, long-horizon tasks in open-ended environments. Large language models (LLMs) have emerged as a powerful backbone for such agents, enabling rich reasoning, flexible action generation, and natural language interaction. In the web domain, LLM-based agents have demonstrated strong capabilities in navigating real websites and accomplishing user-specified goals through multi-step interaction, fueled by advances in prompting, structured reasoning, and action abstractions [Yao et al., 2023, Zhou et al., 2024a, He et al., 2024].
Despite its promise, the effectiveness of online RL for web agents is fundamentally constrained by the cost and risk of real-environment interaction. Agents may trigger irreversible actions such as unintended purchases, account modifications, or data submissions, while also facing non-deterministic page dynamics, transient failures, and external interference. These challenges severely limit the practicality of pure online RL, rendering large-scale policy optimization both costly and hazardous in real-world web environments [Zhou et al., 2024a, Qi et al., 2025]. A natural direction is to replace expensive and risky real-environment interaction with a learned, controllable surrogate that can faithfully approximate web dynamics.
To this end, recent work has begun to explore web world models, learned simulators of web environments. So far, however, their role has been largely auxiliary. Others use world models to synthesize offline trajectories for supervised fine-tuning or imitation-style training, decoupling model-generated experience from on-policy optimization [Fang et al., 2025a, Pahuja et al., 2025]. In this work, we revisit classical model-based reinforcement learning through the lens of modern web agents. DynaWeb treats the world model as a controllable synthetic web environment that can replace or augment costly real interaction.
By training web agents on a mixture of real and imagined experience, DynaWeb enables scalable, on-policy reinforcement learning through imagination while preserving the benefits of interactive learning. Crucially, DynaWeb combines two complementary sources of training experience. These expert trajectories are entirely independent of the world model and correspond to ground-truth web interactions. This simple but effective interleaving strategy preserves the on-policy learning signal and enables efficient online reinforcement learning with significantly fewer real-environment interactions. Recent advances in web agents are largely driven by (multimodal) large language models (LLMs) serving as the core decision-making backbone [Dubey et al., 2024, Jia et al., ].
On top of these models, reasoning and interaction frameworks such as ReAct Yao et al., MCP [Anthropic, 2024], and Cognitive Kernel [Zhang et al., 2024a] enable structured multi-step web actions. Web agents are commonly evaluated on interactive benchmarks including WebArena, WebVoyager, and others. Complementary to direct end-to-end optimization, WebRL [Qi et al., 2025] emphasizes self-evolving curriculum design and result-supervised feedback to continually generate training tasks and improve agent robustness. A representative end-to-end online RL approach is WebAgent-R1 [Wei et al., 2025], which optimizes multi-turn web interaction policies using outcome-based rewards and scalable trajectory sampling (e. g., multi-group GRPO [Shao et al., 2024]). Other Methods adapt RL objectives or combine RL with additional supervision to better shape reasoning and planning behaviors.
DynaWeb boosts web agent learning via simulation
Experiments reveal that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agents, establishing the viability of training through imagination. The team measured performance on the WebArena and WebVoyager benchmarks, demonstrating substantial gains in task completion rates. The core of DynaWeb lies in its web world model, which predicts how web page states evolve in response to agent actions. This model, parameterized by a large language model, operates directly in the observation space, generating naturalistic web page representations. Researchers decomposed the task of predicting the next web state into two subtasks: predicting state change descriptions and then applying those descriptions to alter the current state.
The world model was trained using data from the StanfordNLP/NNetNav dataset, employing a data cleaning pipeline to ensure data quality. The loss function, Lφ = ∑ (I,ot,at,r,∆) −log pφ(r, ∆|I, ot, at), was used to train the model to predict both reasoning traces and subsequent state changes, conditioned on the current accessibility tree and executed action. Tests prove that the learned world model serves as a reusable simulator, generating multi-step imagined trajectories without live web interaction. During training, the agent policy interacts with this simulated environment, sampling actions and receiving predicted observations.
The team obtained task-level completion rewards via model-based self-assessment, assigning a scalar reward r(τ, q) ∈[0, 1] based on task completion. These imagined rollouts, combined with real expert trajectories from the training data, are then used for policy gradient optimization. The breakthrough delivers a framework capable of generating vast quantities of rollout action trajectories for efficient online reinforcement learning, effectively allowing the agent to “dream” and learn from simulated experiences. This work establishes a promising pathway towards developing more robust and adaptable general-purpose AI assistants capable of navigating the complexities of the web.
DynaWeb surpasses agents using web simulation
This approach addresses the inefficiencies, costs, and risks associated with training agents on the open web, utilising a learned ‘web world’ to predict web page representations based on agent actions. The research establishes that the framework’s success stems not simply from increased model capacity or prompting techniques, but from the explicit training of a world model capable of capturing the dynamics of web interactions. The authors acknowledge that a significant performance gap remains between DynaWeb and ideal performance, indicating that even strong large language model priors are insufficient as standalone simulators for imagination-driven reinforcement learning. Future research should focus on refining the world model to more accurately reflect real-world web dynamics. The findings highlight the importance of rollout length and the regularising effect of incorporating real expert data during training, suggesting these are key principles for effective imagination-driven learning. This work points towards world-model-centric learning as a promising direction for developing more capable and efficient web agents.
👉 More information
🗞 DynaWeb: Model-Based Reinforcement Learning of Web Agents
🧠 ArXiv: https://arxiv.org/abs/2601.22149
