Cosmos Policy Achieves 98.5% Robot Control with Single-Stage Video Adaptation

Researchers are tackling the challenge of equipping robots with the ability to learn complex physical tasks directly from video, and a new study led by Moo Jin Kim, Yihuai Gao, and Tsung-Yi Lin from NVIDIA, along with Yunhao Ge, Grace Lam from NVIDIA et al, presents a significant step forward. Their innovative approach, dubbed Cosmos Policy, streamlines the process of adapting pre-trained video models into effective robot control systems, achieving state-of-the-art results on both simulated and real-world benchmarks, including a remarkable 98.5% success rate on the LIBERO simulation, without requiring complex architectural changes or multi-stage training. By learning to generate robot actions within the video’s latent space, Cosmos Policy leverages existing spatiotemporal understanding to plan successful action trajectories and even refine its performance through experience, promising a future where robots can learn and adapt with greater ease and efficiency.

This innovative approach surpasses strong diffusion policies trained from scratch, as well as existing video-based policies and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. The team’s success stems from leveraging the pretrained video model’s ability to model complex, high-dimensional, multimodal distributions, representing actions alongside other modalities within a unified framework. This work establishes a novel paradigm for robotic control, moving beyond multi-stage training processes and complex architectural additions.

By directly fine-tuning the video model to simultaneously generate actions, future states, and values, Cosmos Policy simplifies the learning process while capitalizing on the rich spatiotemporal priors inherent in video data. The researchers release code, models, and training data, facilitating further exploration and development within the robotics community. The study unveils a powerful synergy between video generation and robotic control, opening new avenues for creating more intelligent and adaptable robots. Cosmos Policy’s ability to plan action trajectories based on predicted future states represents a significant advancement in robot autonomy, potentially enabling robots to tackle complex manipulation tasks with greater efficiency and reliability. This breakthrough not only improves performance on established benchmarks but also demonstrates promising results in real-world scenarios, paving the way for more versatile and capable robotic systems in the future.

Latent Diffusion for Robot Policy Learning

This innovative technique allows the model to capture complex action distributions with remarkable efficiency. The study leveraged best-of-N sampling to plan by generating candidate actions, imagining their resulting future states, and ranking these states by predicted value, ultimately executing the highest-value action to increase success rates. Researchers meticulously collected robot demonstration data on the target platform, using this data to fine-tune the pretrained video model and establish visuomotor control and planning capabilities. The team evaluated Cosmos Policy on both simulation and real-world benchmarks, achieving state-of-the-art performance on LIBERO (98.5% average success rate) and RoboCasa (67.1% average success rate).

Furthermore, Cosmos Policy outperformed strong diffusion policies trained from scratch, video-based policies, and state-of-the-art vision-action models in challenging bimanual manipulation tasks, attaining a 93.6% average success rate. Notably, the study demonstrated a 12.5 percent higher task completion rate on average in real-world manipulation tasks when Cosmos Policy was enhanced with model-based planning. Scientists harnessed the Cosmos-Predict2-2B-Video2World model, a latent video diffusion model utilising the Wan2.1 spatiotemporal VAE tokenizer, trained with the EDM denoising score matching formulation, as the foundation for this breakthrough. This single, unified architecture functions simultaneously as the policy, world model, and value function, representing a significant advancement in robotic control.

Cosmos Policy excels in robotic control benchmarks

The team measured success across four LIBERO task suites, demonstrating consistent high performance with rates of 98.1%, 100.0%, 98.2%, and 97.6% respectively. Data shows Cosmos Policy also excels in the RoboCasa simulation, reaching an average success rate of 67.1% with only 50 training demonstrations per task, significantly fewer than the 300 demonstrations required by competing methods. The breakthrough delivers a simple, single-stage post-training process, eliminating the need for complex architectural modifications typically required for adapting video models to robotics. Specifically, in LIBERO, Cosmos Policy outperformed Diffusion Policy with success rates of 72.4% versus 78.3%, Dita at 92.3% versus 97.4%, π0 at 94.2% versus 96.8%, and even the advanced UniVLA at 95.2% versus 96.5%.

The study meticulously evaluated performance using 101 trials per method across all tasks, ensuring a fair comparison with a fixed set of initial states. Furthermore, analysis of real-world ALOHA robot evaluations revealed Cosmos Policy’s superior handling of high-precision manipulation, such as grasping a ziploc slider bag with millimeter tolerance, where competing methods like π0.5 and OpenVLA-OFT+ frequently failed. Qualitative observations show Cosmos Policy effectively addresses challenges involving high action multimodality and precise grasps, while other models struggled with tasks like “put candies in bowl” and “put candy in ziploc bag”. The team’s world model predictions, refined through fine-tuning on policy rollout data, accurately predicted states and enabled more effective planning, ultimately contributing to increased episode success.

Latent Diffusion Enables Single-Stage Robot Control with remarkable

This method bypasses the need for complex architectural modifications or multi-stage training procedures commonly found in existing robotics applications of video generation techniques. The authors acknowledge that Cosmos Policy currently performs best in conditions similar to the training demonstrations, though it exhibits strong performance even in out-of-distribution scenarios, with π0.5 showing slightly better results in those specific tests. Ablation studies reveal that removing auxiliary losses during training leads to a 1.5 point drop in success rate, and training the model from scratch results in a 3.9 point decrease, highlighting the importance of the pretrained model and the joint learning objective. Future research could explore extending Cosmos Policy to more complex robotic platforms and tasks, and investigating methods to further enhance its generalization capabilities to unseen environments and scenarios.

👉 More information
🗞 Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
🧠 ArXiv: https://arxiv.org/abs/2601.16163

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Brain Tumor Segmentation Achieves Improved Accuracy Using 2020 Multi-Modal Data

Brain Tumor Segmentation Achieves Improved Accuracy Using 2020 Multi-Modal Data

January 27, 2026
Dsfedmed Achieves Efficient Federated Medical Image Segmentation Via Mutual Distillation

Dsfedmed Achieves Efficient Federated Medical Image Segmentation Via Mutual Distillation

January 27, 2026
Bootstrap Approximation Achieves High Accuracy for Hermitian One-Matrix Eigenvalue Distributions

Bootstrap Approximation Achieves High Accuracy for Hermitian One-Matrix Eigenvalue Distributions

January 27, 2026