Researchers are tackling the complex challenge of enabling humanoid robots to reliably manipulate objects in real-world environments. Runpei Dong and Saurabh Gupta, both from the University of Illinois Urbana-Champaign, alongside Ziyan Li and Xialin He, present a novel approach to visual loco-manipulation that bridges the gap between robust control and generalisable visual understanding. Their work introduces HERO, a system combining large vision models with simulated training to achieve accurate end-effector control, addressing the limitations of data-intensive imitation learning. By designing an accurate, residual-aware end-effector tracking policy, reducing tracking error by 3.2x, and integrating it with open-vocabulary vision models, the team demonstrates successful object manipulation across diverse settings, including offices and coffee shops, and with objects varying in height from 43cm to 92cm. This research, conducted entirely at the University of Illinois Urbana-Champaign, signifies a substantial step towards more versatile and adaptable humanoid robots capable of seamless interaction with everyday objects.

This advance allows machines to interact with everyday environments, from offices to cafés, and manipulate diverse objects with greater precision. Scientists are developing a system to equip humanoid robots with the ability to reliably grasp novel objects in everyday environments.

Achieving this capability demands precise control of the robot’s end-effector, the hand or tool at the end of its arm, alongside a strong understanding of the surrounding scene derived from visual data. Current methods often rely on extensive real-world training data, limiting their ability to generalise to new situations. This work introduces HERO, a new approach that merges the broad understanding of large vision models with the precision of simulated training to overcome these limitations.

Yet, controlling a humanoid presents unique challenges beyond reaching for an object. Unlike industrial robots operating in structured settings, humanoids must maintain balance while coordinating complex, whole-body movements like bending, squatting, and twisting to access objects. Now, researchers have focused on building an accurate end-effector tracking policy, a system that guides the robot’s hand to the desired location with minimal error.

This policy combines established robotics techniques with machine learning to convert target locations into precise movements, utilising both inverse and forward kinematics. Still, accurate forward kinematics is difficult to achieve on low-cost humanoids. For this reason, a neural forward model was developed to provide an accurate estimate of the end-effector pose.

Beyond this, the system incorporates goal adjustment and replanning to refine the robot’s trajectory. By addressing these issues, end-effector tracking error was reduced by a factor of 3.2. Once perfected, this accurate tracking forms the foundation of a modular system, leveraging large vision models to interpret visual input and identify objects for manipulation.

Here, the system demonstrated its capabilities in diverse real-world settings, including offices and coffee shops, successfully manipulating everyday items like mugs, apples, and toys positioned on surfaces between 43cm and 92cm in height. At a fundamental level, the advances presented open new avenues for training humanoids to interact with the objects around them, moving beyond pre-programmed tasks to true environmental awareness and adaptability. In real-world tests, the system achieved an 83.8% average success rate at reaching and picking up novel objects in challenging scenarios.

Significant reduction in humanoid robot end-effector tracking error via neural network modelling

Achieving an average end-effector tracking error of 2.44cm demonstrates a substantial improvement in humanoid robot control. This level of precision, measured in a motion capture (MOCAP) room, represents a key advancement toward reliable object manipulation. Previous state-of-the-art methods typically exhibited errors ranging from 8 to 13cm, meaning the new system reduces tracking inaccuracies by a factor of 3.2.

Accurate end-effector positioning is vital for successful interaction with objects in active environments. Beyond the MOCAP room, systematic modular and end-to-end tests in simulation further validated the system’s effectiveness. Once systematic errors were addressed through target adjustment, the tracker consistently reached 2.5cm error, a figure confirmed by real-world trials.

Neural forward kinematics models were trained to map accurate end-effector pose relative to the base, while a neural odometry model provided accurate base pose estimation relative to the stationary feet. These models were essential in overcoming limitations inherent in analytical forward kinematics and odometry on a low-cost humanoid robot. For grasping open-vocabulary object queries in novel environments, the system achieved an average success rate of 83.8%.

This performance was evaluated across more than 25 everyday objects situated within 10 diverse and cluttered scenes, all positioned on surfaces ranging from 43cm to 92cm in height. Utilising large pre-trained vision models like Grounding DINO 1.5 and SAM-3 for object detection and segmentation, the modular system successfully identified and grasped a wide variety of items.

Yet, the system’s ability to retarget grasps generated by AnyGrasp to the Dex3 hand on the Unitree robot also contributed to its success. By combining accurate end-effector tracking with strong perception and grasp planning, the research presents a significant step forward in loco-manipulation capabilities for humanoid robots. Now, the system can reliably manipulate objects in diverse real-world settings, including offices and coffee shops.

Neural network enhanced end-effector pose estimation for humanoid robot control

A 72-qubit superconducting processor did not underpin this work; instead, the research centres on a novel end-effector tracking policy for humanoid robots. Initially, the team employed inverse kinematics to translate residual end-effector targets into reference trajectories for the robot’s movements. This conversion allows for precise control by calculating the joint angles needed to reach a desired position.

Beyond simple trajectory generation, a neural forward model was integrated to predict the robot’s forward kinematics, providing a more accurate estimate of end-effector pose. Accurate control demanded more than kinematic calculations. The system receives current and target joint angles, alongside end-effector positions, as inputs to the policy.

Obtaining a precise estimate of the current end-effector position proved challenging, as analytical forward kinematics and odometry proved insufficient on the Unitree G1 humanoid robot. To address this, two neural networks were trained: one to map joint states to end-effector pose, and another to estimate the base pose relative to the robot’s feet. Still, systematic errors persisted, prompting the inclusion of goal adjustment.

By modifying the target position based on the current end-effector tracking error, the policy actively corrects for inaccuracies. Once the system had these components, the team tested the performance in a motion capture room. Here, the full system achieved an average end-effector tracking error of 2.44cm, a substantial improvement over previous state-of-the-art methods reporting errors between 8 and 13cm.

For building a modular system, open-vocabulary large vision models were used for visual perception. Grounding DINO 1.5 and SAM-3 were implemented to detect and segment target objects within the scene. AnyGrasp then generated suitable parallel jaw grasps, which were retargeted for the Dex3 hand mounted on the Unitree robot. For years, the challenge has been bridging the gap between robots performing tasks in controlled laboratory settings and operating reliably amongst the clutter and unpredictability of real-world environments.

Robots struggle with the ‘last inch’ of interaction, the fine motor control needed to reliably grasp and manipulate objects they haven’t seen dozens of times before. Yet, this research sidesteps the need for massive datasets of robotic demonstrations by cleverly combining simulated training with the power of large vision models. By focusing on accurate end-effector tracking, in effect, precisely controlling the robot’s ‘hand’ , the team has created a system that can generalise to a wider range of objects and environments than previously possible.

Once a robot can reliably reach for and touch an object, a surprising number of tasks become achievable. Still, the reliance on simulation always introduces a degree of uncertainty when transferring to the physical world. Although tests in offices and coffee shops demonstrate success with everyday items, the range of objects and scenes remains limited.

Beyond this, the system’s performance with objects of unusual shapes or varying textures requires further investigation. At present, the robot appears to function well with items within a defined size and weight range, but extending this capability is vital. The potential extends beyond picking up a mug or an apple. Instead, consider the implications for assistive robotics, where a robot could reliably help people with everyday tasks in their homes.

Or imagine robots working alongside humans in warehouses or factories, handling a diverse array of components. Further development will likely focus on integrating this precise control with more sophisticated reasoning and planning capabilities, allowing robots to not just manipulate objects, but to understand why they are manipulating them and to adapt to unforeseen circumstances. The path towards truly flexible humanoid robots remains long, but advances like these are steadily shortening it.

👉 More information
🗞 Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
🧠 ArXiv: https://arxiv.org/abs/2602.16705

Tags:

end-effector control Humanoid robotics inverse kinematics Loco-manipulation neural forward model open-vocabulary vision models residual-aware EE tracking RGB-D images.

Robots Learn to Grasp Objects Using Vision and Simulation

Significant reduction in humanoid robot end-effector tracking error via neural network modelling

Neural network enhanced end-effector pose estimation for humanoid robot control

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability