Reconstructing accurate human motion from video remains a significant hurdle in fields like augmented reality, robotics and digital content creation, particularly when subjects are partially obscured. Zhiyin Qian from ETH Zürich, Siwei Zhang, Bharat Lal Bhatnagar, and Federica Bogo, alongside Siyu Tang from ETH Zürich, present a novel solution in their work, ‘Masked Modeling for Human Motion Recovery Under Occlusions’. Their innovative framework, MoRo, tackles the problem of occlusions by employing generative masked modeling, allowing for robust and efficient end-to-end motion recovery from RGB videos. Unlike existing methods that struggle with missing data or suffer from slow processing, MoRo achieves substantial performance gains in both accuracy and realism , even reaching real-time inference at 70 FPS , while maintaining competitive results in clear, non-occluded scenarios, representing a considerable step forward in human motion capture technology.
By employing masked modeling, the system naturally handles occlusions while enabling end-to-end inference, a significant improvement over slower, more complex approaches. This innovative approach incorporates a trajectory-aware motion prior trained on MoCap datasets, an image-conditioned pose prior trained on image-pose datasets to capture diverse per-frame poses, and a video-conditioned masked transformer that fuses these priors. The transformer is then finetuned on video-motion datasets, integrating visual cues with motion dynamics for robust inference, effectively learning from multiple sources to enhance performance.
This allows MoRo to not only reconstruct motion but also to predict missing segments obscured by occlusions, a capability lacking in many current systems. Remarkably, the system achieves performance on-par with existing methods in non-occluded scenarios, proving its versatility and reliability across a range of conditions. Unlike optimization- or diffusion-based methods, which are often slow and sensitive to initial conditions, masked modeling enables efficient, end-to-end inference. By randomly masking sequence segments, the model learns to reconstruct missing parts, mirroring the way humans intuitively fill in gaps in perception, a crucial feature for handling occlusions and creating plausible motion sequences. This work opens exciting possibilities for more immersive AR/VR experiences, more responsive robotic systems, and more realistic digital content creation tools.
Masked Motion Reconstruction from Monocular Video
First, a trajectory-aware motion prior was trained using MoCap datasets, capturing realistic motion dynamics. Simultaneously, an image-conditioned pose prior was trained on image-pose datasets, learning diverse per-frame poses to enhance pose estimation accuracy. These priors are then fused within a video-conditioned masked transformer, which was finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference under occlusion. The system iteratively synthesizes motion by integrating visual cues and motion priors, facilitated by the cross-modality training strategy.
Crucially, the masked modeling technique allows MoRo to naturally handle occlusions by learning to reconstruct missing motion segments, unlike regression-based methods which are fragile to missing observations. The research team implemented a masked transformer architecture, randomly masking segments of the motion sequence during training. This forces the model to predict missing data, effectively learning to infer motion even when parts of the body are obscured. Furthermore, the study pioneered a method for fusing motion and pose priors, allowing the system to leverage both kinematic constraints and visual information for more accurate and plausible reconstructions, a key factor in achieving superior performance in challenging occlusion scenarios.,.
MoRo recovers accurate motion despite occlusions
Experiments revealed that MoRo consistently surpasses existing state-of-the-art methods while maintaining comparable performance when no occlusions are present. By employing masked modeling, the system naturally handles occlusions, enabling end-to-end inference. Measurements confirm a 16% and 31% improvement in MPJPE (Mean Per-Joint Position Error) for visible and occluded body parts, respectively, compared to the best baseline, PromptHMR, on the EgoBody-Occ dataset. Data shows that MoRo achieves 58% better global joint reconstruction (GMPJPE) than RoHM, a previous leading method, by effectively integrating visual evidence with motion dynamics.
Tests prove that the system produces remarkably plausible motion with minimal jitters and foot sliding, addressing a common issue in motion reconstruction. Specifically, MoRo recorded a jitter value of 2.15 and foot sliding of 4.60, significantly lower than many baselines. The team evaluated performance using both camera-space and world-grounded metrics, reporting per-frame metrics like PA-MPJPE, MPJPE, and PVE, alongside global metrics such as RTE, Jitter, and Foot-Sliding. Further quantitative evaluation on the RICH dataset showed MoRo delivering comparable results to baselines, while yielding more plausible motion with lower acceleration and jitter values. Ablation studies on EgoBody-occ highlighted the effectiveness of the motion prior, with notable improvements in both motion realism and pose accuracy for occluded body parts when incorporated into the framework. The confidence-guided masking strategy further narrowed the train-test gap, enhancing the model’s robustness and demonstrating a clear path towards more reliable and accurate human motion reconstruction in real-world applications.
MoRo reconstructs motion despite body occlusions, achieving robust
This research addresses a key challenge in areas like augmented and virtual reality, robotics, and digital content creation, where reliable motion capture is crucial but often hampered by real-world occlusions. MoRo employs masked modeling, a technique borrowed from generative AI, to intelligently handle these occlusions and efficiently reconstruct movement in a consistent, global coordinate system. This allows MoRo to learn robust priors about human movement and pose, improving the accuracy and realism of the reconstructed motion. Acknowledging limitations, the authors note that the current implementation is best suited for videos captured with static cameras and known camera parameters. Future work will focus on extending MoRo to handle dynamic camera scenarios by incorporating techniques for modeling camera motion. This advancement offers a practical solution for a range of applications requiring accurate and efficient human motion reconstruction, particularly in challenging, real-world conditions.
👉 More information
🗞 Masked Modeling for Human Motion Recovery Under Occlusions
🧠 ArXiv: https://arxiv.org/abs/2601.16079
