Accurate 3D spatial perception underpins generalisable robotic manipulation, but achieving reliable, high-quality 3D geometry presents a significant hurdle. Sizhe Yang from Shanghai AI Laboratory and The Chinese University of Hong Kong, alongside Linning Xu and Hao Li et al., address this challenge with their new model, Robo3R. This research introduces a feed-forward reconstruction model that predicts accurate, metric-scale scene geometry from RGB images and robot states in real time, offering a substantial improvement over existing depth sensors and reconstruction methods. By jointly inferring scale-invariant local geometry and relative camera poses, and unifying them within a learned global transformation, Robo3R delivers the precision necessary for complex manipulation tasks, demonstrably enhancing performance in areas such as imitation learning, grasp synthesis and sim-to-real transfer.
This innovation addresses critical limitations in robotic manipulation, where obtaining reliable 3D geometry has long been a challenge due to the noise and material sensitivity of conventional depth sensors.
Existing reconstruction models often lack the precision and metric consistency necessary for physical interaction, prompting the creation of this alternative sensing module. Robo3R jointly infers scale-invariant local geometry and relative camera poses, unifying them into a coherent scene representation within the robot’s coordinate frame via a learned global similarity transformation.
To meet the demanding precision requirements of robotic manipulation, the model employs a masked point head, generating sharp, fine-grained point clouds. A keypoint-based Perspective-n-Point formulation further refines camera extrinsics and global alignment, ensuring accurate spatial understanding. Trained on Robo3R-4M, a newly curated large-scale synthetic dataset comprising four million high-fidelity annotated frames, Robo3R consistently surpasses the performance of state-of-the-art reconstruction methods and traditional depth sensors.
This dataset’s diversity and photorealism facilitate successful transfer to real-world manipulation scenarios, demonstrating the model’s robustness and adaptability. Extensive validation through downstream tasks, including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, reveals consistent performance gains when utilising Robo3R.
Specifically, the model achieves 73.0% accuracy in a single forward pass, compared to 60.0% with depth cameras and 93.3% with the proposed approach, showcasing its superior performance. Furthermore, Robo3R demonstrates notable improvements in collision-free motion planning, reaching 52.1% success compared to other methods, and in grasp synthesis, also achieving 52.1%. These results suggest that Robo3R offers a promising alternative for enhancing spatial perception and enabling more robust and versatile robotic manipulation capabilities, particularly when dealing with transparent, reflective, or small objects that typically challenge conventional sensors.
Real-time Metric Scene Reconstruction via Joint Geometric and Pose Inference
Robo3R, a feed-forward 3D reconstruction model, directly predicts accurate, metric-scale scene geometry from RGB images and robot states in real time. The work addresses limitations in existing reconstruction models and depth sensors regarding precision and metric consistency for robotic interaction. Robo3R jointly infers scale-invariant local geometry and relative camera poses, unifying these into a scene representation aligned with the robot’s coordinate frame via a learned global similarity transformation.
To achieve manipulation-level precision, the model employs a masked point head for generating sharp, fine-grained point clouds. This head decomposes dense point prediction into depth, normalized image coordinates, and a mask, mitigating over-smoothing and enhancing geometric detail through unprojection and masking.
A keypoint-based Perspective-n-Point formulation refines camera extrinsics and further improves global alignment of the reconstructed scene. The research leveraged Robo3R-4M, a curated large-scale synthetic dataset comprising four million high-fidelity annotated frames, to train and validate the model.
This dataset incorporates diverse assets, extensive randomization, and rich modalities to ensure robustness and generalizability. The model’s performance was evaluated across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, consistently demonstrating gains over state-of-the-art reconstruction methods and conventional depth sensors.
The extrinsic estimation module extracts robot keypoints and solves the Perspective-n-Point problem to refine the similarity transformation, ensuring accurate camera pose estimation. Robo3R incorporates an alternating-attention mechanism to facilitate efficient information propagation within and across frames, enhancing reconstruction fidelity. The study highlights the model’s superior robustness in handling transparent, reflective, and small objects, which typically pose challenges for depth cameras.
Robo3R achieves robust 3D reconstruction for challenging robotic manipulation tasks
Robo3R-4M comprises four million frames and features diverse assets, extensive randomization, rich modalities, and annotations. The research introduces Robo3R, a feed-forward 3D reconstruction model tailored for robotic manipulation, achieving high-fidelity depth estimation and precise camera parameter prediction.
Robo3R maintains a canonical coordinate system in real time while accurately predicting metric scaling. Extensive qualitative and quantitative experiments validate Robo3R as a superior alternative to depth cameras for spatial perception in robotic manipulation. The work demonstrates that Robo3R produces higher-quality 3D representations and exhibits greater robustness to varying object materials and challenging scenarios.
This enhanced robustness extends to transparent, reflective, and tiny objects that typically hinder depth cameras. The accurate 3D geometry generated by Robo3R enables improved performance in downstream applications, including real-world imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.
Robo3R predicts a scale-invariant local 3D representation in the camera coordinate system via unprojection, deriving local point clouds from normalized image coordinates and depth. The model employs a masked point head to decode scale-invariant local geometry, alongside a relative pose head for registering points across multiple views.
A global similarity transformation maps these points into metric-scale 3D geometry within the canonical robot frame. The system processes RGB images and robot states through a transformer backbone utilising alternating global and frame-wise attention mechanisms.
Realtime 3D scene reconstruction via joint geometric and pose inference
Robo3R, a feed-forward model for 3D reconstruction, delivers accurate, metric-scale scene geometry directly from RGB images and robot states in real time. The model jointly infers scale-invariant local geometry and relative camera poses, unifying these into a consistent scene representation aligned with the robot’s coordinate frame.
A masked point head generates sharp, fine-grained point clouds, while a keypoint-based Perspective-n-Point formulation refines camera positioning and global alignment. Trained on the Robo3R-4M dataset, containing four million annotated frames, the system consistently surpasses existing reconstruction methods and conventional depth sensors in performance.
Integration of robot state information during reconstruction improves point map estimation and absolute camera pose accuracy, exceeding the performance of alternative fusion methods. Demonstrations across imitation learning, sim-to-real transfer, grasp synthesis, and motion planning reveal consistent performance gains, indicating potential as a sensing module for robotic manipulation.
Currently, Robo3R is limited to pinhole cameras and a specific range of robot embodiments. Future development will focus on expanding its compatibility to include fisheye and panoramic cameras, as well as a broader variety of robot configurations, potentially through the generation of additional training data. This work establishes a cost-effective, accurate, and robust alternative to traditional depth sensors, capable of handling diverse materials and challenging conditions, thereby enhancing several robotic applications.
👉 More information
🗞 Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
🧠 ArXiv: https://arxiv.org/abs/2602.10101
