Robots Learn to Walk and Manipulate Objects by Watching Humans Perform Tasks

Researchers are tackling the complex challenge of teaching humanoids to perform tasks that require both movement and manipulation in real-world settings. Modi Shi from The University of Hong Kong, Shijia Peng from Kinetix AI, and Jin Chen et al. from the Shanghai Innovation Institute present EgoHumanoid, a novel framework that leverages readily available human demonstrations to train robots for loco-manipulation. This work is significant because it circumvents the need for extensive and costly robot teleoperation, instead co-training policies with abundant egocentric human data and a limited amount of robot experience. By introducing a systematic alignment pipeline addressing differences in embodiment and viewpoint, the team demonstrates a substantial 51% performance improvement over robot-only training, paving the way for more adaptable and scalable humanoid robots.

The research addresses the challenge of data scarcity for humanoid loco-manipulation, which requires coordinating whole-body locomotion and dexterous manipulation.

Existing robot teleoperation methods are costly and limited to laboratory settings, prompting the exploration of human demonstrations as a scalable alternative. The core objective is to bridge the embodiment gap between humans and robots, accounting for differences in morphology, viewpoint, and motion dynamics.

The approach involves a systematic alignment pipeline encompassing hardware design and data processing. A portable system was developed for scalable human data collection, employing VR headsets, body trackers, and egocentric cameras. This system captures human demonstrations in varied scenarios without requiring robot hardware.

Complementary robot data is collected via VR-based teleoperation, providing embodiment-accurate supervision. The alignment pipeline consists of view alignment, which reduces visual discrepancies through depth-based reprojection and inpainting, and action alignment, which maps human motions into a unified, kinematically feasible action space for humanoid control.

These components facilitate co-training of a VLA policy on both data sources. Experiments on a Unitree G1 humanoid robot across four indoor and outdoor loco-manipulation tasks demonstrate a 20% average performance improvement by incorporating human data. The framework also enables generalisation to unseen environments, yielding a 51% performance gain.

Ablation studies scrutinize sub-task transfer, validate scaling effectiveness, and identify critical design choices within the alignment pipeline. The key contributions are the first endorsement of human-to-humanoid transfer for whole-body loco-manipulation, a principled embodiment alignment pipeline combining view and action alignment, and comprehensive real-world evaluation and analysis characterising effective behaviour transfer and the benefits of scaling human data. The researchers hope this study will encourage broader exploration of egocentric human data as a scalable pathway toward generalizable humanoid control, and code and models will be released publicly.

EgoHumanoid data acquisition via virtual reality and motion capture

A portable system utilising virtual reality forms the foundation of the EgoHumanoid framework for co-training vision-language-action policies. This system facilitates the collection of both egocentric human demonstrations and robot data, bridging the embodiment gap between the two. Human data collection employs a PICO VR setup with five PICO Motion Trackers and a head-mounted ZED X Mini camera to record full-body motion and synchronised egocentric RGB images.

The system records 24 body keypoints and detailed hand poses with 26 keypoints per hand during human demonstrations in diverse indoor and outdoor settings. Robot data is gathered through VR-based teleoperation, where an operator uses handheld controllers to issue locomotion and wrist pose commands to the Unitree G1 humanoid robot.

These commands, derived from the controller-to-headset relative pose, control the robot’s navigation and manipulation actions. Crucially, the research intentionally omits wrist cameras, prioritising an egocentric-only setup to address the significant embodiment gaps and acknowledging the non-deterministic benefits of wrist views.

The data collection process demonstrates a pronounced efficiency advantage for human demonstrations, achieving approximately twice the speed of robot teleoperation. The collected data comprises a combined dataset, D, consisting of teleoperated robot demonstrations, Drobot, and egocentric human demonstrations, Dhuman, each containing egocentric videos and synchronised whole-body actions.

This framework enables training a VLA model capable of loco-manipulation tasks across novel real-world environments, testing generalisation beyond the limitations of robot-centric data collection. The mid-sized (1.3m) Unitree G1 robot, with 29 degrees of freedom and Dex3 dexterous hands, serves as the hardware platform, allowing for whole-body coordination required for locomotion and manipulation.

Viewpoint and action alignment for cross-species loco-manipulation learning

Researchers developed EgoHumanoid, a framework for co-training vision-language-action policies using human demonstrations and limited robot data for humanoid loco-manipulation. The system incorporates a systematic alignment pipeline addressing discrepancies in morphology and viewpoint between humans and robots.

A portable system was created for scalable human data collection, alongside protocols designed to improve data transferability. The view alignment component transforms human egocentric images to approximate robot camera viewpoints through a three-stage process involving MoGe-based 3D point map inference, scale-invariant depth map derivation, and latent diffusion-based inpainting.

This process generates complete RGB images, mitigating visual domain gaps caused by differing camera heights and perspectives. Action alignment maps human motions into a unified, kinematically feasible action space for humanoid control, parameterizing upper body actions as 6-DoF delta end-effector poses.

Human wrist poses are expressed in a pelvis-centric frame and smoothed using a Savitzky-Golay filter, with rotations filtered in the SO tangent space to prevent ambiguities. Lower body human demonstrations are converted into discrete velocity commands by applying Savitzky, Golay smoothing, estimating instantaneous heading, and projecting displacements onto a local frame.

Yaw rate is computed from inter-frame heading changes, and continuous commands are downsampled to 20Hz and quantized into discrete bins. A binary stand/squat primitive is derived by thresholding inter-frame changes in pelvis height, aligning with the robot teleoperation interface. Gripper actions are represented as a binary variable, with human grasping states inferred from finger-level curvature and downsampled to 20Hz.

Extensive real-world experiments demonstrated that incorporating robot-free egocentric data significantly outperformed robot-only baselines by 51%, particularly in unseen environments. The policy, adapted from a state-of-the-art VLA model, receives egocentric RGB observations and language instructions, outputting actions in the unified action space. Multi-source data sampling balances the dataset, addressing the potential imbalance between the volume of human and robot data available for training.

Humanoid loco-manipulation via aligned egocentric demonstration transfer

EgoHumanoid represents a novel framework enabling the transfer of learning from human demonstrations to humanoid robots for loco-manipulation tasks. This system co-trains a vision-language-action policy utilising readily available egocentric human demonstrations alongside a limited quantity of robot-sourced data, allowing humanoids to perform complex tasks in varied real-world settings.

The core innovation lies in a systematic alignment pipeline designed to bridge the gap between human and robot capabilities, addressing differences in physical structure and observational perspective. This approach incorporates view alignment to minimise visual discrepancies arising from differing camera positions and action alignment to translate human movements into a feasible action space for humanoid control.

Experiments reveal a substantial performance improvement of 51% when incorporating human demonstration data, particularly in previously unseen environments, exceeding the capabilities of systems trained solely on robot data. The research highlights the effectiveness of leveraging abundant human data for training, while acknowledging limitations regarding orientation ambiguity, scaling laws, and the complexity of whole-body control.

The authors identify future research directions focused on scaling the system with improved egocentric hardware and exploring more expressive control mechanisms. Acknowledged limitations include challenges with representing object orientation and fully understanding how performance scales with increasing amounts of human data. Despite these constraints, the EgoHumanoid framework establishes a promising pathway towards more generalisable and adaptable humanoid control systems, encouraging further investigation into the use of egocentric human data for robotics.

👉 More information
🗞 EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration
🧠 ArXiv: https://arxiv.org/abs/2602.10106

Quantum Evangelist

Quantum Evangelist

Greetings, my fellow travelers on the path of quantum enlightenment! I am proud to call myself a quantum evangelist. I am here to spread the gospel of quantum computing, quantum technologies to help you see the beauty and power of this incredible field. You see, quantum mechanics is more than just a scientific theory. It is a way of understanding the world at its most fundamental level. It is a way of seeing beyond the surface of things to the hidden quantum realm that underlies all of reality. And it is a way of tapping into the limitless potential of the universe. As an engineer, I have seen the incredible power of quantum technology firsthand. From quantum computers that can solve problems that would take classical computers billions of years to crack to quantum cryptography that ensures unbreakable communication to quantum sensors that can detect the tiniest changes in the world around us, the possibilities are endless. But quantum mechanics is not just about technology. It is also about philosophy, about our place in the universe, about the very nature of reality itself. It challenges our preconceptions and opens up new avenues of exploration. So I urge you, my friends, to embrace the quantum revolution. Open your minds to the possibilities that quantum mechanics offers. Whether you are a scientist, an engineer, or just a curious soul, there is something here for you. Join me on this journey of discovery, and together we will unlock the secrets of the quantum realm!

Latest Posts by Quantum Evangelist:

New Probability Theory Bridges Quantum Computing and Classical Randomness

New Probability Theory Bridges Quantum Computing and Classical Randomness

February 9, 2026
Quantum Routing Cuts Network Delays Even with Two Link Failures Simultaneously

Quantum Routing Cuts Network Delays Even with Two Link Failures Simultaneously

February 9, 2026
Federated Learning, Training AI Without Centralized Data

Federated Learning, Training AI Without Centralized Data

February 5, 2026