Spatial reasoning remains a significant challenge for vision-language models, hindering their ability to perform accurately in real-world applications requiring precise measurements and understanding of environments. Siyi Chen, Mikaela Angelina Uy, and Chan Hee Song, along with colleagues, address this limitation by introducing a novel framework, SpaceTools, that empowers models to effectively utilise a suite of tools for enhanced spatial understanding. The team demonstrates that through a two-phase training process called Double Interactive Reinforcement Learning, models learn to coordinate tools such as depth estimators and pose estimators, moving beyond reliance on pre-defined tool sequences. This approach achieves state-of-the-art results on established spatial understanding benchmarks and enables reliable object manipulation with a robotic arm, representing a substantial advance over existing methods and paving the way for more capable and adaptable embodied artificial intelligence.
LLMs Enhance Robotic Spatial Reasoning
This research presents a system that combines Large Language Models (LLMs) with specialized visual tools to perform complex spatial reasoning, particularly for robotic manipulation. The system effectively integrates object detection, depth estimation, and grasp planning capabilities, showcasing a significant advancement in robotic intelligence. The team identified areas for improvement, including enhancing tool accuracy and robustness, and integrating real-time robot feedback into the training process. The system overcomes limitations of relying solely on LLMs or traditional computer vision by leveraging a suite of visual tools to extract relevant information from images. It successfully reasons about spatial relationships, estimates grasp positions, and controls a robot to grasp and place objects. Recognizing the limitations of existing models in precise spatial tasks crucial for robotics, researchers designed DIRL to coordinate multiple tools through interactive exploration and feedback. This approach allows the model to autonomously discover optimal tool-use patterns, overcoming the challenges of relying on fixed tool pipelines or manual prompting. DIRL operates in two phases, beginning with a teaching phase that combines demonstrations from a single-tool specialist with traces from a system utilizing all available tools.
The second phase, exploration, further refines multi-tool coordination through continued reinforcement learning. To address the computational demands of interactive training, the team developed Toolshed, a platform for hosting computationally intensive computer vision tools as rapid on-demand services. By incorporating real and stochastic tool outputs into the learning loop, DIRL encourages reasoning about tool reliability and discovers improved querying strategies. This work addresses the challenge of equipping VLMs with the ability to utilize tools without relying on pre-defined pipelines or extensive manual prompting. The team achieved substantial improvements in spatial understanding benchmarks and demonstrated reliable real-world manipulation using a robotic arm. DIRL operates in two phases, combining demonstrations from a single tool specialist trained via interactive reinforcement learning with traces from a system utilizing all tools.
This allows the VLM to refine its multi-tool coordination through continued reinforcement learning. Experiments reveal that SpaceTools, the VLM trained using this method, achieves a significant performance increase on the RoboSpatial benchmark compared to standard fine-tuning and baseline reinforcement learning approaches. The team also introduced Toolshed, an interactive platform designed to host a diverse range of computer vision tools, facilitating seamless communication between the VLM and external resources during both data collection and training. This method enables VLMs to effectively coordinate multiple visual tools, such as depth estimators and segmentation tools, through a two-phase training process involving both demonstrations and continued reinforcement learning. The resulting system, SpaceTools, achieves state-of-the-art performance on established spatial understanding benchmarks and successfully controls a robotic arm. This work demonstrates that VLMs can acquire complex spatial reasoning skills through learned tool coordination, rather than requiring architectural changes or extensive data fine-tuning.
Experiments reveal substantial improvements over existing methods, with DIRL outperforming standard fine-tuning and single-tool reinforcement learning approaches. Notably, training with a single powerful tool unexpectedly improves performance on different tasks, suggesting a capacity for skill transfer and out-of-domain generalization. Researchers acknowledge that overuse of tools and misinterpretation of nuanced outputs remain challenges for VLMs, and future work will focus on addressing these limitations to further refine tool integration and improve the reliability of spatial reasoning in complex environments.
👉 More information
🗞 SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
🧠 ArXiv: https://arxiv.org/abs/2512.04069
