Egowm Achieves 25-Dof Humanoid Prediction with Action-Conditioned World Models

Scientists are tackling the challenge of creating video generation models capable of accurately predicting future events following specific actions. Anurag Bagchi, Zhipeng Bao, and Homanga Bharadhwaj, from Carnegie Mellon University, alongside Yu-Xiong Wang (University of Illinois Urbana-Champaign), Pavel Tokmakov (Toyota Research Institute), and Martial Hebert (Carnegie Mellon University), present a novel method called Egocentric World (EgoWM) that leverages existing video data to build action-conditioned ‘worlds’ for controllable future prediction. This research is significant because it avoids costly training from scratch, instead repurposing vast internet video resources and injecting motor commands via simple conditioning layers , enabling remarkably realistic and generalisable predictions across diverse robotic embodiments, from simple robots to complex humanoids. EgoWM demonstrably improves the physical correctness of predicted scenarios, achieving up to 80 percent better structural consistency and six times faster inference, even allowing robots to ‘walk through’ paintings.

Scientists are presenting Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pre-trained video diffusion model into an action-conditioned world model, enabling controllable prediction of the future. Rather than training from scratch, they re-purpose the rich world priors of Internet-scale video models, injecting motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully, while preserving generalization and realism. Their approach scales naturally across embodiments and action spaces, from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle, driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation, requiring only modest fine-tuning. To evaluate physical correctness independently of appearance, researchers introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with physics.

Action-conditioned prediction via repurposed video diffusion models

Scientists pioneered the Egocentric World Model (EgoWM), a novel method transforming pre-trained video diffusion models into action-conditioned world models for controllable future prediction. Rather than training from scratch, the research team repurposed rich world priors from Internet-scale video, injecting motor commands via lightweight conditioning layers to faithfully follow actions while preserving realism and generalisation. This innovative approach scales across embodiments and action spaces, extending from 3-DoF mobile robots to 25-DoF humanoids, presenting a substantial challenge in predicting egocentric joint-angle-driven dynamics. The study engineered a system where action inputs are encoded and added to the model’s time-conditioning pathway through learned scale-and-shift transformations, operating directly in the joint-angle space of the robot or humanoid.

This design remains architecture-agnostic, avoiding alterations to the original model’s layers and functioning across diverse backbones, including both UNet-based and DiT-based diffusion models. Experiments employed a diffusion transformer model, leveraging its ability to generate coherent rollouts for both navigation and manipulation tasks with only modest fine-tuning. The technique reveals a capability not previously demonstrated by open-source world models, applicability to high-dimensional humanoid navigation and manipulation. To rigorously evaluate physical correctness independent of visual appearance, researchers introduced the Structural Consistency Score (SCS). SCS measures whether stable scene elements evolve consistently with physical laws and constraints.

EgoWM predicts video futures with 80% SCS gain

Scientists have developed Egocentric World (EgoWM), a novel method for controllable future prediction from video, transforming any pretrained video into an action-conditioned world model. The research team repurposed existing Internet-scale video priors and injected motor commands via lightweight conditioning layers, enabling faithful action following while maintaining realism and strong generalization across diverse robotic embodiments. Experiments revealed that EgoWM achieves coherent rollouts for both navigation and manipulation tasks with only modest fine-tuning, demonstrating a significant advancement in predictive modelling. The team measured Structural Consistency Score (SCS) to independently evaluate physical correctness, discovering EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models.

Data shows this improvement signifies a substantial leap in the accuracy of predicted physical interactions within the simulated environment. Furthermore, tests prove EgoWM delivers up to six times lower inference latency, enabling real-time predictions and robust generalization to unseen environments, including navigation inside paintings, a challenging test of perceptual understanding. Measurements confirm the system successfully predicts egocentric joint-angle-driven dynamics, even for complex 25-DoF humanoids, substantially more challenging than previous 3-DoF robot simulations. Results demonstrate the framework’s scalability across embodiments and action spaces, ranging from simple mobile robots to highly articulated humanoids.

Scientists recorded that the method operates in a low-dimensional latent space, mapping videos into latent tensors and decoding predicted trajectories using a spatio-temporal VAE. The breakthrough delivers a versatile approach to world modelling, adapting existing pre-trained models to a wide variety of tasks without requiring extensive data collection or custom model design. The study introduces the Structural Consistency Score (SCS), a metric which explicitly disentangles action-following accuracy from visual fidelity, providing a more nuanced evaluation of world model performance. Specifically, SCS assesses whether stable scene elements evolve consistently with predicted actions.

EgoWM boosts physically consistent future predictions

Scientists have developed Egocentric World (EgoWM), a novel method for creating action-conditioned world models from pre-trained video data. This approach repurposes existing video understanding capabilities, injecting motor commands via lightweight conditioning layers to enable controllable future prediction without requiring training from scratch. EgoWM successfully scales across various robotic embodiments, from simple 3-DoF mobile robots to complex 25-DoF humanoids, demonstrating robust performance in challenging, high-dimensional action spaces. The research introduces the Structural Consistency Score (SCS) to assess the physical correctness of predicted futures, independently of visual realism.

Experiments reveal that EgoWM significantly improves SCS scores, by up to 80 percent, compared to previous state-of-the-art navigation world models, while also achieving substantially lower inference latency and maintaining strong generalization capabilities, even when applied to unrealistic visual environments like navigating within paintings. The authors acknowledge limitations in the inherent realism of predictions when applied to drastically different visual domains, despite maintaining coherent motion and control. Future research could explore expanding the scope of action conditioning and refining the SCS metric for even more precise evaluation of world model performance. This work represents a step towards scalable, controllable, and generalizable world models, effectively bridging the gap between passive visual prediction and active, action-driven forecasting, potentially enabling more sophisticated and adaptable robotic systems. The findings demonstrate the power of leveraging pre-trained models and lightweight conditioning techniques to build robust and efficient world representations.

👉 More information
🗞 Walk through Paintings: Egocentric World Models from Internet Priors
🧠 ArXiv: https://arxiv.org/abs/2601.15284

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Csst Strong Lensing Achieves Cosmological Constraints with Hundreds of DSPL Systems

Csst Strong Lensing Achieves Cosmological Constraints with Hundreds of DSPL Systems

January 23, 2026
Einstein Telescope Advances Gravitational Wave Inference for up to Tens of Binaries

Einstein Telescope Advances Gravitational Wave Inference for up to Tens of Binaries

January 23, 2026
Ejwst Catalogue Achieves Complete Access to All Active Galactic Nuclei Observations

Ejwst Catalogue Achieves Complete Access to All Active Galactic Nuclei Observations

January 23, 2026