Vision-action policies represent a significant step towards creating robots that seamlessly understand and interact with the world, but current systems often struggle when faced with situations outside of their original training data. Jiahui Zhang, Ze Huang, and Chun Gu, from Fudan University, alongside Zipei Ma and Li Zhang, present a new approach that overcomes this limitation by combining learned world models with reinforcement learning. Their work introduces Prophet, a system pretrained on diverse robot data, which accurately predicts the outcomes of actions and allows for rapid adaptation to new robots, objects, and environments. This innovation, coupled with new algorithms for efficient policy reinforcement, delivers substantial performance gains on both simulated and real-world robotic tasks, representing a practical and data-efficient path towards more robust and adaptable robot control.

This research addresses the challenges of training robots to perform complex tasks, overcoming limitations in both data efficiency and the stability of optimisation processes. The team introduces a novel system that combines a learned world model with a reinforcement learning procedure specifically designed for action-based control, enabling robots to learn more effectively from limited data and adapt to new situations.

Prophet Learns and Guides Robotic Manipulation

Researchers developed ProphRL, a new training paradigm that couples a learned world model, named Prophet, with a vision-action policy for robotic manipulation. This approach improves data efficiency and optimisation stability, addressing shortcomings in traditional imitation learning and reinforcement learning methods. At each training step, the vision-action policy predicts a sequence of actions based on an initial image and instruction. Prophet then generates a video clip depicting the robot executing those actions, creating a closed-loop system where the predicted future frames refine the policy’s input.

This process allows for long-horizon planning, enabling the policy to execute complex manipulation tasks. Prophet itself is built upon a latent video diffusion pipeline, employing a video autoencoder to compress real video clips into compact latent representations. A diffusion model, utilising a DiT denoiser, learns to iteratively remove noise from these latent representations, reconstructing the original video. The denoiser is trained to accurately predict clean latent representations from noisy ones. Crucially, Prophet is conditioned on the predicted action, allowing it to generate videos that accurately reflect the robot’s response to specific commands.

To facilitate training across diverse datasets, the team standardised the representation of robot actions. Low-level control commands are represented as a seven-dimensional vector for each end-effector, encompassing translational and rotational changes, as well as gripper control. This representation is consistent across all datasets, even those with varying numbers of end-effectors, by padding trajectories with zero values as needed. The action is defined as a local change in position and orientation with respect to the previous end-effector frame, expressed as a rigid-body motion. Finally, the policy is optimised using FA-GRPO, an adaptation of a group-regularized policy gradient algorithm, and FlowScale, a technique that reweights denoising steps to stabilise gradients during training. The entire system is evaluated using an offline reward model that assesses the success of each trajectory.

Prophet and FlowScale Stabilize Robot Learning

This work presents a new approach to training robots using Vision-Action (VLA) policies, addressing limitations of traditional imitation learning and the challenges of real-world reinforcement learning. Researchers developed Prophet, a large-scale, action-conditioned video world model pretrained on diverse robot data, which functions as a reusable simulator capable of adapting to new robots, objects, and environments. This simulator enables more effective reinforcement learning by providing a safe and efficient environment for policy optimisation. Building upon Prophet, the team introduced FlowScale, a technique that stabilises gradients during reinforcement learning within the world model loop, and Flow-action-GRPO, an algorithm adapted for VLA actions.

Experiments demonstrate that this combined system, termed ProphRL, achieves success gains of between 5 and 17 percent on established benchmarks and notably improves real-robot performance with gains of 24 to 30 percent across various VLA configurations. The authors acknowledge that the current system is computationally demanding, primarily due to the size of the Prophet world model. Future research directions focus on improving the efficiency of the model through architectural simplification, distillation techniques, feature caching, and specialised inference kernels, potentially enabling scaling to more complex tasks and longer horizons. These advancements promise to further enhance the capabilities of robots learning through vision and action.

👉 More information
🗞 Reinforcing Action Policies by Prophesying
🧠 ArXiv: https://arxiv.org/abs/2511.20633

Tags:

action-outcome dynamics flow-based action heads flow-GRPO flowscale heterogeneous robot data Reinforcement Learning robot actuation Robot control Vision-Action policies VLA post-training

Prophesying Reinforces Robot Action Policies, Achieving 30% Improvement in Few-Shot Adaptation and 17% Enhanced Control

Prophet Learns and Guides Robotic Manipulation

Prophet and FlowScale Stabilize Robot Learning

Rohail T.

Latest Posts by Rohail T.:

Top Quarks Show Quantum Link Beyond Entanglement

Entanglement Breaks Expected Rules with Holistic Systems

Researchers Confirm 2025 Signature Conjecture Using Dedekind Sums