Researchers are increasingly recognising human behaviour as a scalable source of data for advancing artificial intelligence, but effectively utilising this data for complex dexterous manipulation has remained a significant challenge. Ruijie Zheng from NVIDIA, Dantong Niu from NVIDIA and University of California, Berkeley, and Yuqi Xie from NVIDIA, and colleagues, address this problem by presenting EgoScale, a novel framework for transferring human skills to robots using a large-scale egocentric dataset. This work, a collaboration between NVIDIA, the University of California, Berkeley, and the University of Maryland, demonstrates a clear correlation between the scale of human data, exceeding 20,854 hours of labelled video, and both validation loss and subsequent real-world robotic performance. The team’s findings establish large-scale human data as a predictable and reusable source of supervision, enabling substantial improvements in dexterous manipulation capabilities and one-shot task adaptation, representing a significant step towards more adaptable and intelligent robotic systems.

For decades, teaching robots to manipulate objects with human-like skill has proved remarkably difficult. Now, a new approach demonstrates that vast amounts of everyday human movement data can reliably improve robotic dexterity, offering a pathway towards more adaptable and helpful machines in our homes and workplaces. Scientists are increasingly turning to human behaviour as a scalable source of data for developing intelligent systems.

Effectively using this wealth of information for complex robotic manipulation has remained a significant challenge, but large-scale human data can support the fine-grained control needed for truly dexterous, high-degree-of-freedom movements. Researchers have introduced EgoScale, a framework designed to use large-scale egocentric human data, videos recorded from a person’s viewpoint.

The Vision Language Action Model Architecture

This approach centres on training a Vision Language Action (VLA) model, a type of artificial intelligence that combines visual understanding with language processing and action prediction. Trained on over 20,854 hours of action-labelled video, a dataset more than 20times larger than previous efforts, a clear relationship emerged between the amount of human data and the model’s ability to accurately predict actions.

Validation loss, a measure of how well the model generalizes to unseen data, decreased in a predictable, log-linear fashion as the dataset grew. The team found a strong correlation between this validation loss and actual performance on a physical robot, confirming that large-scale human data can serve as a reliable source of supervision for robotic learning.

A two-stage transfer process was devised, beginning with extensive pretraining on human data, followed by a lightweight mid-training phase using aligned human and robot demonstrations. This method allows for strong long-horizon manipulation, where the robot performs a sequence of actions over an extended period, and enables one-shot task adaptation, meaning the robot can learn a new skill from just a single demonstration.

Achieving High Dexterity on Physical Robots

The resulting policy achieved a 54% improvement in average success rate compared to a baseline without pretraining, utilising a 22-DoF dexterous robotic hand. The ability of the policy to transfer to robots with fewer degrees of freedom is particularly compelling. By providing a reusable, embodiment-agnostic motor prior, large-scale human motion data appears to offer a foundational understanding of manipulation that transcends specific robotic hardware, suggesting a future where human demonstrations can dramatically amplify the effectiveness of robot learning, treating humans as another scalable embodiment in the learning process. The potential for robots to learn complex skills from observing people is becoming increasingly tangible.

Data Volume Predicts Robot Manipulation Performance

Log-linear scaling of human data predicts robot manipulation performance

Initial validation loss measurements revealed a clear log-linear relationship between the scale of human data and performance, demonstrating that as the volume of human action data increased, prediction loss consistently decreased. The research leveraged 20,854 hours of egocentric human video, exceeding prior datasets by a factor of 20. This extensive dataset enabled the discovery of predictable scaling laws governing human-to-robot transfer for dexterous manipulation.

Once established, this correlation allowed for extrapolation of performance gains with even larger datasets. The work details an effective two-stage transfer recipe: large-scale human pretraining followed by a lightweight mid-training phase utilising aligned human and robot data. During the mid-training stage, the policy benefited from co-training with paired human and robot demonstrations in similar environments.

This approach yielded a 54% average improvement in success rate on a 22 degree-of-freedom dexterous robotic hand, when compared to a baseline without pretraining. The benefits extended beyond high-DoF hands, with learned representations also transferring effectively to robots with fewer degrees of freedom, such as the Unitree G1 tri-finger hand. Policies pre-trained on human data achieved over 30% absolute improvement in success rate across evaluated tasks, again compared to the no-pretraining baseline.

For instance, with only a single robot demonstration, the policy attained up to 88% average success on the shirt folding task, despite the mid-training data containing only folding behaviours. The research also highlights the importance of action supervision. The model was supervised using human wrist motion and retargeted high-DoF hand joint actions, encouraging the extraction of manipulation-relevant information.

By focusing on relative wrist motion, the system learned to prioritise physically grounded action representations. These findings establish large-scale human data as a scalable and predictable source of supervision for learning dexterous manipulation policies.

Sensor data processing and hand pose estimation for robotic manipulation

A 22-degree-of-freedom (DoF) robotic hand served as the primary platform for evaluating dexterous manipulation policies. Raw sensor streams from egocentric cameras were processed to create a unified action representation suitable for pretraining and robot execution. Each human demonstration included egocentric RGB observations, estimated camera motion, and human hand pose derived from existing perception pipelines.

Camera pose at time t was represented as a transformation *T t w←c * within the special Euclidean group (SE), detailing the orientation and position between the world and camera frames. Human hand pose was modelled using 21 keypoints, each defined as a rigid transform *H t c,i * in the camera frame, with the wrist corresponding to index 1. Wrist pose in the world frame, *W t w *, was then calculated by combining the camera pose and the wrist transform.

To achieve invariance to global camera movement, arm motion was expressed as relative wrist motion between consecutive timesteps, calculated as ∆W t * = (W 0 w *) −1 *W t w *. This formulation captures local arm motion in a physically meaningful way, shared across both human demonstrations and robot executions. For finger-level control, the 21 human hand keypoints were retargeted into the joint space of the Sharpa hand using an optimisation procedure that respected joint limits and kinematic constraints.

This ensured preservation of human finger articulation during pretraining while aligning with the robot’s control interface. The methodology was designed to transfer effectively to robots with fewer DoF, suggesting the learned human motion provides a reusable motor prior. The work leveraged a large-scale mixture of egocentric human activity datasets, totaling 20,854 hours of video, encompassing diverse environments and tasks.

These recordings, spanning 9,869 scenes, 6,015 tasks, and 43,237 objects, provided broad coverage of manipulation behaviours. Complementing this, 829 hours of data from the EgoDex dataset, collected with accurate wrist and hand tracking using Apple Vision Pro, were also incorporated to refine the action representations. Although estimates from standard SLAM and hand-pose estimation pipelines were noisy, the sheer scale and diversity of the data proved effective for learning transferable representations.

Extensive video data unlocks improved dexterity in robotic manipulation

For years, roboticists have strived to bridge the gap between the fluidity of human movement and the often clumsy actions of machines. A new framework called EgoScale suggests a path forward, not through more complex algorithms or sophisticated hardware, but through sheer volume of data. By training robots on over 20,000 hours of human video demonstrating dexterous manipulation, researchers have demonstrated a clear correlation between data scale and robotic performance.

This isn’t about making robots mimic human actions; it’s about imbuing them with a broader understanding of how objects interact and how hands move in the world. Scaling up data isn’t a simple task, and previous attempts have struggled to translate increased information into improved robotic control. Unlike earlier efforts, EgoScale employs a two-stage transfer learning approach, first pre-training on the extensive human dataset and then fine-tuning with limited robot-specific data.

Since this method allows for adaptation to robots with varying numbers of degrees of freedom, it suggests that the learned “motor prior” is genuinely reusable, rather than tied to a specific robotic anatomy. The reliance on egocentric video, a first-person perspective, introduces a potential bias, as the robot’s view of the world will always differ from that of the human demonstrator.

One-shot task adaptation, where a robot can learn a new skill from a single demonstration, is now within reach. The work raises important questions about the nature of robotic intelligence. Is simply scaling up imitation enough, or will true dexterity require robots to develop a deeper understanding of physics and object affordances. Beyond the immediate improvements in robotic manipulation, this research could have implications for areas like virtual reality and augmented reality, where realistic hand movements are essential for immersive experiences. Instead of endlessly refining algorithms, the future of robotics may lie in collecting and curating the right kind of data, effectively turning humans into teachers for a new generation of machines.

👉 More information
🗞 EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
🧠 ArXiv: https://arxiv.org/abs/2602.16710

Tags:

Dexterous manipulation

Robots Learn Skills from 20,854 Hours of Human Video

The Vision Language Action Model Architecture

Achieving High Dexterity on Physical Robots

Data Volume Predicts Robot Manipulation Performance

Log-linear scaling of human data predicts robot manipulation performance

Sensor data processing and hand pose estimation for robotic manipulation

Extensive video data unlocks improved dexterity in robotic manipulation

Rohail T.

Latest Posts by Rohail T.:

Scalable Phonon Lasers Overcome Limitations for Focused Vibrational Control

Microstructure Predicts Qubit Coherence, Reducing Decoherence Loss by Two Orders of Magnitude

Fewer Atoms Needed: Light Emission Scales with One Divided by N Cubed