Researchers are tackling the challenge of creating more natural and adaptable human-robot interactions, focusing on how robots can better understand and execute complex physical tasks. Sirui Xu, Samuel Schulter (Amazon), and Morteza Ziyadi (Amazon), alongside Xialin He, Xiaohan Fei (Amazon), and Yu-Xiong Wang (University of Illinois Urbana-Champaign), present InterPrior, a novel framework designed to scale generative control for physics-based human-object interactions. This work is significant because it moves beyond explicitly programmed movements, instead leveraging learned ‘priors’ , underlying assumptions about balance, contact, and manipulation, to allow robots to coordinate their actions more effectively and generalise skills to new situations, ultimately paving the way for more intuitive and robust robotic systems.
Learning robust loco-manipulation through imitation and reinforcement of physically perturbed data
Scientists have developed InterPrior, a scalable framework capable of learning a unified generative controller for complex, physics-based human-object interactions. This work addresses the challenge of creating humanoids that can seamlessly compose and generalise loco-manipulation skills across varied environments while maintaining physically realistic whole-body coordination.
InterPrior distills expertise from large-scale imitation learning into a versatile, goal-conditioned policy that reconstructs motion from multimodal observations and high-level intentions. The distilled policy, however, initially lacks reliable generalisation capabilities due to the expansive configuration space inherent in human-object interactions.
To overcome this limitation, researchers applied data augmentation techniques involving physical perturbations, followed by reinforcement learning finetuning to enhance competence on previously unseen goals and starting conditions. These combined steps consolidate reconstructed latent skills into a valid manifold, resulting in a motion prior that extends beyond the original training data and can incorporate interactions with novel objects.
This innovative approach moves beyond simply mimicking demonstrated behaviours, instead generating expressive trajectories and maintaining task success even with varied physical properties. InterPrior supports multiple goal formulations, including sparse targets and their combinations, enabling a broader range of skills than typical procedural routines.
The resulting controller demonstrates robust failure recovery, exemplified by the ability to re-grasp objects after unsuccessful attempts, and maintains stability under external disturbances. Furthermore, InterPrior facilitates user-interactive control and exhibits potential for deployment on real humanoid robots, as demonstrated through sim-to-sim evaluation and keyboard-based control.
By leveraging distillation to inherit broad skills from large-scale data and employing reinforcement learning as a local optimiser, this research establishes a crucial link between data reconstruction and robust, generalisable policy learning. This framework represents a significant step towards creating humanoids capable of intuitive and adaptable interaction with the physical world.
Distillation of imitation and reinforcement learning for robust policy refinement
A masked conditional variational policy forms the core of the InterPrior framework, distilling expertise from a full-reference imitation expert. This policy reconstructs motor control signals from sparse, multimodal goals and distilled information, effectively learning from large-scale human-object interaction demonstrations.
The system employs data augmentation techniques, introducing perturbations to enhance the robustness and generalizability of the learned policy beyond the initial training data. Following distillation, reinforcement learning finetuning consolidates latent skills into a valid interaction manifold, improving performance on both unseen goals and novel starting conditions.
This finetuning process optimises two key objectives simultaneously: maximising success rates on previously unencountered goals and initial states, and preserving the knowledge acquired during pretraining through regularisation. The researchers synthesised natural in-between motions using the pretrained base policy, and crucially, incorporated failure states to specifically train recovery behaviours such as re-approaching and re-grasping.
This approach transforms reconstructed latent skills into a stable and continuous manifold, enabling generalisation beyond the original training trajectories. InterPrior supports multiple goal formulations, including sparse targets and combinations thereof, within a single policy. The framework scales to large human-object interaction datasets, facilitating affordance-rich interactions beyond simple grasping tasks.
Furthermore, the system generates expressive trajectories rather than merely replicating demonstrations, and maintains task success even with varied physical properties. The resulting controller demonstrates robustness, enabling mid-trajectory command switching, successful re-grasps after failures, and stability under external perturbations, and was trained on the G1 humanoid robot with sim-to-sim evaluation.
InterPrior demonstrates improved robotic manipulation through progressive architectural enhancements
Researchers developed InterPrior, a scalable framework achieving 90.0% success on in-distribution goal-conditioned tasks, alongside a position error of 13.6 and an error of 9.5. The study details performance across snapshot, trajectory, and contact tasks, as well as long-horizon multi-goal chains and object lifting under random human initialization.
Initial experiments with a MaskedMimic baseline, utilising an InterMimic expert, yielded 64.2% success, 29.3 position error, 22.1 observation error, and a 12.6% failure rate. Progressively incorporating components of InterPrior, including an InterMimic+ expert, latent shaping loss, bounded latent and observation spaces, and reinforcement learning finetuning, demonstrably improved performance.
The addition of the latent shaping loss resulted in 74.9% success, 20.4 position error, 15.5 observation error, and 10.6% failure rate, while bounding the latent and observation spaces increased success to 89.1%, reduced position error to 11.7, observation error to 8.9, and failure rate to 6.0. Full-reference tracking experiments on thin-geometry interactions and with initialization noise revealed InterPrior’s superior performance, achieving higher success rates than InterMimic.
While InterMimic attained lower position error through strict tracking, InterPrior intentionally deviated to realign contact, prioritising interaction completion. Qualitative results demonstrate InterPrior’s ability to sustain minute-long whole-body interactions with multiple objects, self-correcting drift and exhibiting robustness induced by reinforcement learning finetuning. Zero-shot generalization to unseen objects and interactions was also observed, with InterPrior completing degrees of freedom and converging to feasible contact even with imperfect data.
Distillation and reinforcement learning enable generalisable human-robot manipulation
InterPrior, a physics-based generative motion controller, successfully scales human-object interaction by combining large-scale imitation distillation with reinforcement learning finetuning. The framework learns a unified generative controller by first distilling expert demonstrations into a versatile, goal-conditioned policy capable of reconstructing motion from multimodal observations and high-level intent.
Subsequent reinforcement learning, applied alongside data augmentation, consolidates these skills into a generalizable motion prior extending beyond the initial training data. This approach yields a controller that maintains natural whole-body coordination while substantially improving robustness and competence in loco-manipulation tasks.
InterPrior effectively composes skills, transitions smoothly between actions, and recovers from failures across varied contact and dynamic conditions. The decoupled learning process broadens the range of achievable tasks, skills, and dynamics, and facilitates interactive control with potential applications to diverse humanoid embodiments.
Limitations acknowledged by the developers include challenges with extremely thin or elongated objects, and difficulties in multi-goal chaining where discrepancies in alignment can prioritize balance over precise goal achievement. Future research will focus on integrating perception, language-conditioned goals, and richer affordances to further advance InterPrior towards robust sim-to-real assistive manipulation and teleoperation.
👉 More information
🗞 InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
🧠 ArXiv: https://arxiv.org/abs/2602.06035
