Researchers are tackling the persistent problem of limited data hindering the development of dexterous bimanual robotic manipulation. Juncheng Mu, Sizhe Yang from Shanghai AI Laboratory and The Chinese University of Hong Kong, and Yiming Bao from Tsinghua University, with Hojin Bae, Tianming Wei and Linning Xu, present a novel framework, DexImit, designed to generate plausible robotic training data from readily available monocular human videos. This work is significant because it addresses the substantial ‘embodiment gap’ between human and robotic hands, enabling robots to learn complex manipulation skills, including tool use, long-horizon tasks and fine-grained movements, without requiring expensive and laborious real-world data collection. DexImit achieves this through an automated four-stage pipeline reconstructing interactions, decomposing subtasks, synthesising trajectories and performing data augmentation, ultimately unlocking the potential of large-scale human video data for robot learning.
Researchers address data scarcity which fundamentally limits the generalisation of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and time-consuming. The work presents DexImit, a system for generating a large and diverse dataset of bimanual manipulation tasks.
DexImit utilises a physics-based simulation environment and a novel task generation pipeline to create over 10,000 unique, fine-grained tasks. The generated tasks cover a broad range of manipulation skills, including object rearrangement, tool use, and assembly. A gallery showcases the breadth of manipulation tasks generated by DexImit. The specific contribution lies in providing a synthetic dataset to facilitate the training and evaluation of bimanual dexterous manipulation algorithms.
Leveraging human video data for scalable bimanual dexterous manipulation
Scientists increasingly recognise data scarcity as a critical challenge in dexterous manipulation. Robotic hands are highly articulated end-effectors capable of performing a wide range of real-world tasks, including contact-rich and interaction-intensive operations. However, collecting large-scale datasets for bimanual dexterous manipulation is significantly more challenging than for simple jaw-grippers due to the difficulty of teleoperation and the high cost of hardware.
Compared to real-world robotic data, human manipulation videos are readily available at a larger scale and cover a substantially broader range of task categories. Furthermore, the emergence of video generation models enables scalable generation of human manipulation videos from text prompts. Human videos inherently encode high-level task concepts while simultaneously capturing low-level manipulation actions, offering a promising avenue for scaling dexterous manipulation.
However, learning from human videos poses significant challenges. The most straightforward approach is to treat the human hand as a heterogeneous embodiment and use it directly for pretraining. These methods suffer from a severe embodiment gap, as discrepancies in visual observations and action spaces substantially constrain cross-embodiment learning.
Another line of work reconstructs 3D hand-object keypoint flows or object trajectories from videos, then reproduces the demonstrations using motion planning or reinforcement learning. This paradigm effectively eliminates the embodiment gap, but most of them rely on absolute depth information, while others require strict reconstruction accuracy to avoid reinforcement learning training failures, fundamentally limiting scalability.
Moreover, existing methods struggle with challenging scenarios involving fast motions, occlusions, or complex interactions. To address these challenges, we propose a four-stage framework: depth-free reconstruction with near-metric scale; action-centric scheduling for long-horizon bimanual coordination; force-closure-based action generation for robust manipulation and comprehensive data augmentation to cover complex real-world environments.
Through these designs, DexImit can operate without additional depth or camera information, offering a scalable solution to alleviate the scarcity of bimanual dexterous manipulation data in the real world. Specifically, our method first reconstructs hand-object trajectories from monocular human videos captured from arbitrary viewpoints, mapping them into a shared world coordinate system with near-metric scale.
We then perform video understanding and subtask decomposition, and introduce an Action-Centric Task Scheduling Algorithm to enable dynamic bimanual coordination. Finally, we synthesize grasps based on force-closure constraints and reconstructed hand poses, and reproduce the demonstrated interactions through motion planning.
To support zero-shot deployment in complex real-world environments, we design a comprehensive data augmentation pipeline. It includes randomization of object pose and scale, as well as augmentation of camera pose and visual observations to ensure robust real-world generalisation. Building upon these designs, DexImit is capable of generating bimanual dexterous manipulation data across diverse scenarios, including tasks involving complex physical interaction, unleashing the potential of large-scale human manipulation videos for robot learning.
Extensive experimental results further demonstrate that policies trained on the resulting data generalise to real-world deployment in a zero-shot manner. In summary, our contributions are threefold: an automated data generation pipeline for bimanual dexterous manipulation, synthesizing physically plausible data covering a broad range of tasks directly from videos; a comprehensive data augmentation system, including object pose and scale, as well as camera pose and visual observation, facilitating zero-shot deployment of policies on real robots without any real-world data; and extensive experiments demonstrate that DexImit can generate high-fidelity robot data for diverse tasks, including long-horizon, tool-using, and fine-grained manipulation, demonstrating its effectiveness in alleviating the long-standing data scarcity problem in dexterous manipulation.
A growing number of studies have focused on learning transferable dexterous manipulation skills from human videos with the emergence of large-scale human manipulation datasets and video generation models. A straightforward method is to treat the human hand as an end-effector and incorporate it directly into the policy pretraining.
However, such methods face fundamental limitations due to substantial visual and action embodiment gaps. Other methods leverage human videos to train world models for planning; however, world models tailored specifically for manipulation remain relatively underexplored. A separate line of work reconstructs 3D trajectories directly from videos and uses the recovered motions to synthesize robot manipulation data, effectively bridging the embodiment gap.
Building on this paradigm, DexImit effectively mitigates compounding errors of the data generation, enabling the synthesis of challenging long-horizon tasks involving complex physical interactions. To reconstruct hand-object interactions from videos without access to additional information, such as camera pose, intrinsic parameters, or depth, we must rely on 4D reconstruction from monocular observations.
VGGT and several recent methods perform feed-forward estimation of depth and camera parameters, enabling per-frame point cloud reconstruction, which can then be combined with hand-object segmentation and point cloud registration to achieve tracking. End-to-end tracking methods instead directly estimate foreground point cloud motion.
However, these approaches are often limited in accuracy, producing point clouds of insufficient quality for reliable manipulation. With recent advances in image-to-3D generation and hand pose estimation, single-image 3D mesh generation has become sufficiently accurate for downstream manipulation tasks.
As a result, hand-object interactions can be reconstructed based on the generated meshes. Finally, 6D pose estimation methods are used to estimate object poses at each video frame. Given the reconstructed hand-object poses, our goal is to generate robot manipulation data from these trajectories.
Prior approaches typically rely on reinforcement learning or motion retargeting to reproduce hand-object interactions. However, these methods are sensitive to object scale variations and trajectory noise, suffer from substantial sim-to-real gaps, and remain inadequate for long-horizon bimanual dexterous manipulation.
In contrast, DexImit introduces a multi-arm long-horizon task scheduling algorithm and synthesizes structured, robust actions based on MANO-prompts and force-closure constraints, enabling reliable generation of physically plausible manipulation data. We propose DexImit, a framework that directly generates bimanual dexterous manipulation data from human videos.
DexImit employs a four-stage data generation pipeline: 4D trajectory reconstruction of hand-object interactions; subtask decomposition and bimanual scheduling; structured action generation to produce robot manipulation trajectories; and comprehensive data augmentation. The overall pipeline is illustrated in Figure.
This design enables DexImit to generate dexterous manipulation data at scale and supports zero-shot real-world deployment. The reconstruction of hand-object interactions involves the following steps: video processing and task understanding using a Vision-Language Model; frame-by-frame semantic segmentation to isolate objects involved in manipulation; object generation and hand pose estimation for each frame; 6D pose estimation for both the object and hand to obtain precise trajectories; and coordinate transformation from camera coordinates to a shared world frame. a) Video Process: Given a video V = {Ii}K i=0 with a frame rate f, we sample it to a constant frame rate f_t to obtain the target video: Vt = n I⌊i·f ft ⌋ oKt i=0, where Kt = ⌊K·ft f ⌋−1 is the last index of sampled frames.
Next, we employ Qwen3-VL to perform video understanding on the V_t. The model is tasked with identifying the set of objects S _ oo= \ { o _i \ m i d i=0,1, \dots, N_o} involved in the manipulation process. b) Segmentation: The data generation pipeline requires three types of masks: object mask mo = {moi}No i=0 for 3D generation and 6D pose estimation; hand mask mh = {mhi | i = 0, 1} for hand trajectory estimation, where h0 denotes the left hand and h1 denotes the right hand; and table mask mt for determining the world coordinate system.
Specifically, we use Grounded Sam2 to perform frame-by-frame segmentation to generate these three masks. c) Objects and Hands Reconstruction: In order to reconstruct hand-object interactions at near-metric scale, we first use the depth estimation method SpatialTracker v2 to estimate the unscaled depth D = {Di}Kt i=0 for each frame of the video. Since the input RGB video Vt does not contain any depth information, an initial scale estimation step is necessary.
Drawing inspiration from previous work, the limited variance in human hand sizes provides a reliable prior to approximate metric scale. Building on this key insight, we use the human hand to estimate a scale factor for D. Take left hand h0 as an example, with the first frame hand mask m0 h0, we extract the first-frame hand point cloud P0 h0.
Next, we use Wilor to estimate hand mesh M0 h0. While this provides the correct orientation of the hand in the camera coordinate system, it lacks position information. To jointly estimate both the hand pose and metric scale, we proceed with the following align-render-align steps: we first translate the centre of M0 h0 to align it with the centre.
Reconstructing and Augmenting Four-Dimensional Hand-Object Interaction Data for Robotic Manipulation
DexImit, an automated framework, reconstructs four-dimensional hand-object interactions from monocular human manipulation videos without requiring additional information. The system facilitates the generation of large-scale robot data based on these videos, sourced either from the internet or video generation models.
This pipeline encompasses reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale accuracy, performing subtask decomposition and bimanual scheduling, and synthesizing robot trajectories consistent with demonstrated interactions. The research focuses on generating physically plausible data covering a broad range of tasks directly from videos, addressing the longstanding data scarcity problem in dexterous manipulation.
DexImit employs a comprehensive data augmentation pipeline, randomizing object pose and scale, alongside camera pose and visual observations, to ensure robust real-world generalisation. This augmentation system facilitates zero-shot deployment of policies on real robots without the need for any real-world data collection.
DexImit is capable of handling diverse manipulation tasks, including tool use such as cutting an apple, long-horizon tasks like making a beverage, and fine-grained manipulations like stacking cups. The framework’s design enables the synthesis of challenging long-horizon tasks involving complex physical interactions, mitigating compounding errors inherent in data generation.
The system reconstructs hand-object interactions by processing videos and employing a Vision-Language Model to understand the task, followed by semantic segmentation to isolate manipulated objects. Object and hand pose estimation are performed for each frame, with 6D pose estimation providing precise trajectories, and coordinate transformation aligning everything to a shared world frame. DexImit introduces a multi-arm, long-horizon task scheduling algorithm and synthesizes robust actions based on MANO-prompts and force-closure constraints, enabling reliable generation of physically plausible manipulation data at scale.
Automated data generation enables scalable bimanual dexterous manipulation learning
DexImit, an automated framework for bimanual dexterous manipulation, directly utilises readily available RGB videos from sources like the internet or video generation models to facilitate robot learning without requiring additional data. The system employs a four-stage pipeline beginning with the four-dimensional reconstruction of hand-object interactions, followed by subtask decomposition and bimanual scheduling.
Action generation is then based on force-closure constraints, culminating in comprehensive data augmentation to produce physically plausible data for a diverse range of manipulation tasks. Extensive experimentation demonstrates DexImit’s capacity to scale data generation and its effectiveness in challenging tasks involving complex hand-object interactions.
This addresses a longstanding limitation in the field of dexterous manipulation, namely the scarcity of suitable training data. The current implementation is limited in its ability to handle complex in-hand manipulation, and future work may focus on end-to-end data generation approaches to improve efficiency and accuracy. Adapting the method to accommodate deformable and articulated objects represents another potential avenue for development, necessitating more powerful three-dimensional generation models capable of simulating object deformation and articulated kinematics.
👉 More information
🗞 DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos
🧠 ArXiv: https://arxiv.org/abs/2602.10105
