Moro Achieves Robust Human Motion Recovery under Occlusions from Monocular Videos

Reconstructing accurate human motion from video remains a significant hurdle in fields like augmented reality, robotics and digital content creation, particularly when subjects are partially obscured. Zhiyin Qian from ETH Zürich, Siwei Zhang, Bharat Lal Bhatnagar, and Federica Bogo, alongside Siyu Tang from ETH Zürich, present a novel solution in their work, ‘Masked Modeling for Human Motion Recovery Under Occlusions’. Their innovative framework, MoRo, tackles the problem of occlusions by employing generative masked modeling, allowing for robust and efficient end-to-end motion recovery from RGB videos. Unlike existing methods that struggle with missing data or suffer from slow processing, MoRo achieves substantial performance gains in both accuracy and realism , even reaching real-time inference at 70 FPS , while maintaining competitive results in clear, non-occluded scenarios, representing a considerable step forward in human motion capture technology.

By employing masked modeling, the system naturally handles occlusions while enabling end-to-end inference, a significant improvement over slower, more complex approaches. This innovative approach incorporates a trajectory-aware motion prior trained on MoCap datasets, an image-conditioned pose prior trained on image-pose datasets to capture diverse per-frame poses, and a video-conditioned masked transformer that fuses these priors. The transformer is then finetuned on video-motion datasets, integrating visual cues with motion dynamics for robust inference, effectively learning from multiple sources to enhance performance.

This allows MoRo to not only reconstruct motion but also to predict missing segments obscured by occlusions, a capability lacking in many current systems. Remarkably, the system achieves performance on-par with existing methods in non-occluded scenarios, proving its versatility and reliability across a range of conditions. Unlike optimization- or diffusion-based methods, which are often slow and sensitive to initial conditions, masked modeling enables efficient, end-to-end inference. By randomly masking sequence segments, the model learns to reconstruct missing parts, mirroring the way humans intuitively fill in gaps in perception, a crucial feature for handling occlusions and creating plausible motion sequences. This work opens exciting possibilities for more immersive AR/VR experiences, more responsive robotic systems, and more realistic digital content creation tools.

Masked Motion Reconstruction from Monocular Video

First, a trajectory-aware motion prior was trained using MoCap datasets, capturing realistic motion dynamics. Simultaneously, an image-conditioned pose prior was trained on image-pose datasets, learning diverse per-frame poses to enhance pose estimation accuracy. These priors are then fused within a video-conditioned masked transformer, which was finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference under occlusion. The system iteratively synthesizes motion by integrating visual cues and motion priors, facilitated by the cross-modality training strategy.

Crucially, the masked modeling technique allows MoRo to naturally handle occlusions by learning to reconstruct missing motion segments, unlike regression-based methods which are fragile to missing observations. The research team implemented a masked transformer architecture, randomly masking segments of the motion sequence during training. This forces the model to predict missing data, effectively learning to infer motion even when parts of the body are obscured. Furthermore, the study pioneered a method for fusing motion and pose priors, allowing the system to leverage both kinematic constraints and visual information for more accurate and plausible reconstructions, a key factor in achieving superior performance in challenging occlusion scenarios.,.

MoRo recovers accurate motion despite occlusions

Experiments revealed that MoRo consistently surpasses existing state-of-the-art methods while maintaining comparable performance when no occlusions are present. By employing masked modeling, the system naturally handles occlusions, enabling end-to-end inference. Measurements confirm a 16% and 31% improvement in MPJPE (Mean Per-Joint Position Error) for visible and occluded body parts, respectively, compared to the best baseline, PromptHMR, on the EgoBody-Occ dataset. Data shows that MoRo achieves 58% better global joint reconstruction (GMPJPE) than RoHM, a previous leading method, by effectively integrating visual evidence with motion dynamics.

Tests prove that the system produces remarkably plausible motion with minimal jitters and foot sliding, addressing a common issue in motion reconstruction. Specifically, MoRo recorded a jitter value of 2.15 and foot sliding of 4.60, significantly lower than many baselines. The team evaluated performance using both camera-space and world-grounded metrics, reporting per-frame metrics like PA-MPJPE, MPJPE, and PVE, alongside global metrics such as RTE, Jitter, and Foot-Sliding. Further quantitative evaluation on the RICH dataset showed MoRo delivering comparable results to baselines, while yielding more plausible motion with lower acceleration and jitter values. Ablation studies on EgoBody-occ highlighted the effectiveness of the motion prior, with notable improvements in both motion realism and pose accuracy for occluded body parts when incorporated into the framework. The confidence-guided masking strategy further narrowed the train-test gap, enhancing the model’s robustness and demonstrating a clear path towards more reliable and accurate human motion reconstruction in real-world applications.

MoRo reconstructs motion despite body occlusions, achieving robust

This research addresses a key challenge in areas like augmented and virtual reality, robotics, and digital content creation, where reliable motion capture is crucial but often hampered by real-world occlusions. MoRo employs masked modeling, a technique borrowed from generative AI, to intelligently handle these occlusions and efficiently reconstruct movement in a consistent, global coordinate system. This allows MoRo to learn robust priors about human movement and pose, improving the accuracy and realism of the reconstructed motion. Acknowledging limitations, the authors note that the current implementation is best suited for videos captured with static cameras and known camera parameters. Future work will focus on extending MoRo to handle dynamic camera scenarios by incorporating techniques for modeling camera motion. This advancement offers a practical solution for a range of applications requiring accurate and efficient human motion reconstruction, particularly in challenging, real-world conditions.

👉 More information
🗞 Masked Modeling for Human Motion Recovery Under Occlusions
🧠 ArXiv: https://arxiv.org/abs/2601.16079

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

AI Achieves State-Of-The-Art Scientific Discovery with Test-Time Training to Discover

AI Achieves State-Of-The-Art Scientific Discovery with Test-Time Training to Discover

January 27, 2026
Anything Achieves State-Of-The-Art Perspective-To-360° Image and Video Generation

Anything Achieves State-Of-The-Art Perspective-To-360° Image and Video Generation

January 27, 2026
Tensor Networks Advance Understanding of Flux Strings in D Quantum Electrodynamics

Tensor Networks Advance Understanding of Flux Strings in D Quantum Electrodynamics

January 27, 2026