Robot Brains Boosted by 44,000 Hours of Human Video Learning

Scientists are tackling the challenge of creating robots capable of performing a wide range of tasks in unpredictable environments. Shenyuan Gao, William Liang, and Kaiyuan Zheng, all from NVIDIA, alongside colleagues such as Ayaan Malik, Seonghyeon Ye, and Sihyun Yu, present DreamDojo, a new foundation world model designed to learn from and simulate complex human interactions. This research is significant because it leverages 44,000 hours of human video footage, the largest dataset of its kind, to train a robot to understand and execute dexterous controls, even with limited labelled data, and ultimately advances the development of general-purpose robotic systems capable of operating in real-world scenarios.

Large-scale egocentric video pre-training enables versatile robotic skill acquisition

Scientists have unveiled DreamDojo, a foundation world model poised to revolutionise the development of adaptable robots. This innovative system simulates outcomes in diverse environments, addressing a critical challenge in robotics, the need for models that accurately predict the consequences of actions, particularly for complex, dexterous tasks.
DreamDojo overcomes limitations imposed by scarce data and action labels by leveraging a novel approach to learning and simulation. The research introduces a system trained on 44,000 hours of egocentric human videos, representing the largest video dataset ever used for pre-training a world model. This extensive dataset, termed DreamDojo-HV, encompasses a remarkably diverse range of daily scenarios, incorporating approximately 96times more skills and 2,000times more scenes than existing public datasets for robot learning.

To circumvent the scarcity of labelled actions, researchers implemented continuous latent actions, functioning as unified proxy actions that enhance knowledge transfer from the unlabeled video data. DreamDojo demonstrates a strong grasp of physics and precise action controllability following post-training on limited robot-specific data.

Furthermore, a newly devised distillation pipeline accelerates DreamDojo to a real-time prediction speed of 10.81 FPS, while simultaneously improving the consistency of predicted sequences. This achievement unlocks several key applications, including live teleoperation, policy evaluation, and model-based planning, offering a pathway to more efficient and robust robotic systems.

Systematic evaluation on challenging out-of-distribution benchmarks confirms the significance of this method for simulating open-world, contact-rich tasks, establishing a foundation for general-purpose robot world models. The resulting model can autoregressively predict future frames at a resolution of 640 × 480, enabling sustained interaction for over one minute in real time without visual degradation. This breakthrough paves the way for extensive policy evaluation without the need for real-world deployment, accelerating the development of advanced robotic capabilities and opening new avenues for human-robot collaboration.

Construction of a large-scale egocentric video dataset and foundation world model training

A 44,000-hour egocentric human video dataset, termed DreamDojo-HV, forms the foundation of this work, representing the largest video corpus to date for pretraining world models. This dataset surpasses existing resources by several orders of magnitude, encompassing a diverse range of activities with approximately 96times more skills and 2,000times more scenes than the most extensive public datasets for robot learning.

The research addresses the scarcity of action labels by introducing continuous latent actions as unified proxy actions, enabling self-supervised extraction of semantically meaningful actions between video frames. DreamDojo, the resulting foundation world model, was trained to acquire a comprehensive understanding of physics and achieve plausible simulations across diverse environments.

Model architecture and training recipes were rigorously designed to facilitate this learning process, enabling fine-grained controllability over continuous robot actions. To achieve real-time prediction, a distillation pipeline was implemented following the Self Forcing paradigm, significantly reducing computational cost for downstream applications.

This distillation process also enhances long-horizon consistency by efficiently modeling a short temporal context, allowing the model to autoregressively predict future frames at a resolution of 640 × 480 at 10.81 FPS for an arbitrary horizon. The resulting model can therefore be interacted with for over one minute in real time without visual degradation, facilitating applications such as live teleoperation and model-based planning. This methodology enables zero-shot generalization to unseen objects and novel environments, marking a significant advancement in general-purpose robot world models.

Latent action discovery and distillation enable real-time robotic control from extensive egocentric video data

Researchers developed DreamDojo, a foundation world model learning interactions and dexterous controls from 44,000 hours of egocentric human videos. This dataset represents the largest video collection to date used for pretraining world models, encompassing diverse daily scenarios with a wide range of objects and skills.

To overcome limited action labels, the study introduced continuous latent actions as unified proxy actions, improving knowledge transfer from unlabeled videos. Following post-training on small-scale robot data, DreamDojo demonstrated precise action controllability and a strong understanding of its environment.

A distillation pipeline accelerated the model to a real-time speed of 10.81 frames per second, while also enhancing context consistency. This pipeline utilizes a warmup stage regressing student predictions to match ODE solutions generated by the teacher, minimizing error between the two models. The distillation process further incorporates a second stage where the student context is constructed using its own generated latents, aligning the training distribution with inference conditions.

This alignment is achieved through a Kullback-Leibler divergence loss, guiding the student distribution toward the teacher’s distribution. To improve robustness against compounding errors, the student generates N’ frames, exceeding the teacher’s horizon N, further minimizing discrepancies between training and testing.

Extensive experiments were conducted using a 700 million parameter spatiotemporal Transformer, trained for 400,000 steps with a batch size of 256 and a latent action dimension of 32. The model was pretrained on a mixture of human and in-house robot datasets, including Unitree G1, Fourier GR-1, AgiBot, and YAM, with videos temporally downsampled by a random factor of 1, 2, 3, or 4. Both a 2 billion parameter and a 14 billion parameter model were pretrained for 140,000 steps with an effective batch size of 1024, utilizing 256 NVIDIA H100 GPUs and a learning rate of 1.6x 10⁻⁴.

Generalisation, real-time performance and physics understanding in DreamDojo

DreamDojo, a foundation world model capable of simulating dexterous robotics tasks, has been developed and demonstrated to generalise to previously unseen scenarios. The model was pretrained using a large-scale dataset of human videos, encompassing a wide range of everyday interactions and totalling 44,000 hours of footage.

To improve knowledge transfer and action control, continuous latent actions were implemented as proxy actions throughout the training process. Further enhancements included a distillation pipeline designed to enable stable, long-horizon interactions at real-time speeds, achieving 10.81 frames per second.

Extensive evaluations confirmed the significance of DreamDojo, revealing improved understanding of physics and more accurate action following in out-of-distribution scenarios, alongside a positive correlation with real-world evaluations and real-time interactivity for applications like live teleoperation and policy steering. Despite these advancements, DreamDojo exhibits limitations when simulating uncommon actions, such as slapping or fast waving.

Success rates within the simulated environment also tend to be higher than those observed in real-world applications, suggesting an inability to accurately model nuanced failures. Future research will focus on expanding the range of actions the model can accurately simulate, potentially through policy rollouts, and on optimising inference speed through engineering improvements. Additionally, extending the model to support multi-view simulation and exploring more effective fine-tuning strategies to preserve pretrained knowledge are areas for further investigation.

👉 More information
🗞 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
🧠 ArXiv: https://arxiv.org/abs/2602.06949

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Black Hole Maths Unlocks Secrets of How Energy Flows in Exotic Matter

Black Hole Maths Unlocks Secrets of How Energy Flows in Exotic Matter

February 10, 2026
Hidden Rules of Physics Revealed by Limiting Information Access

Hidden Rules of Physics Revealed by Limiting Information Access

February 10, 2026
Quantum AI Shortcut Could Speed up Language Models with Reduced Complexity

Quantum AI Shortcut Could Speed up Language Models with Reduced Complexity

February 10, 2026