Multimodal large language models currently excel at many vision-related tasks, but consistently struggle to accurately interpret spatial relationships within scenes. Hunar Batra, Haoqin Tu, and Hardy Chen from University of California, Santa Cruz, alongside Yuanze Lin, Cihang Xie from University of California, Santa Cruz, and Ronald Clark, present a new approach called SpatialThinker that significantly improves a model’s ability to reason about three-dimensional space. SpatialThinker trains models using reinforcement learning to build a detailed understanding of scenes, focusing on objects and their spatial connections, and then rewards accurate reasoning about these relationships. This method not only generates a new, high-quality dataset for spatial visual question answering, but also allows the resulting model to surpass existing approaches, including GPT-4o, and achieve a substantial improvement in spatial understanding with limited training data, representing a key step towards more human-like visual reasoning in artificial intelligence.

Spatial Reasoning via Reinforcement Learning

This research details advancements in improving spatial reasoning capabilities in multimodal large language models (MLLMs). The team employed reinforcement learning (RL) to fine-tune these models, enabling them to better understand and reason about spatial relationships in images and text. The core of this approach involves rewarding the model for accurate spatial predictions and penalizing errors, effectively guiding its learning process. A crucial component of this work is a carefully designed reward function that focuses on accurate object detection and localization, understanding depth, height, and orientation in 3D scenes, and correctly identifying spatial relationships between objects.

To facilitate training, the researchers created STVQA-7K, a dataset specifically designed for spatial visual question answering. Experiments demonstrate that this approach significantly improves performance on challenging spatial reasoning benchmarks, including CVBench and 3DSRBench. Qualitative analysis reveals that the trained models, SpatialThinker-3B and SpatialThinker-7B, demonstrate an improved ability to identify fine-grained object distinctions and accurately understand 3D relationships. Importantly, these models achieve competitive or superior performance compared to other open-source MLLMs and even proprietary models like GPT-4o and Gemini. Furthermore, the spatial reasoning skills learned through RL transfer to more abstract domains, confirming the effectiveness of the reward function.

Spatial Reasoning with Scene Graph Reinforcement Learning

Researchers have developed SpatialThinker, a novel multimodal large language model (MLLM) designed to enhance 3D spatial understanding. This model pioneers a new approach to training by integrating structured scene graph grounding with multi-step reasoning, simulating human-like spatial perception. The method involves constructing question-focused scene subgraphs, capturing relevant objects, their relationships, and localized coordinates within an image, and then reasoning over these structured representations to arrive at an answer. To support this work, the scientists created STVQA-7K, a high-quality spatial visual question answering dataset grounded in scene graphs, and developed a scalable pipeline capable of generating up to 108,000 samples.

The training process utilizes online policy reinforcement learning guided by a multi-objective reward framework prioritizing structured reasoning, regional focus, accuracy, and precise localization. This framework employs format rewards, count penalties, accuracy rewards, and CIoU-based spatial rewards. Experiments demonstrate that SpatialThinker-7B, trained on only 7,000 samples, outperforms supervised fine-tuning by 6% and conventional reinforcement learning baselines by 3. 2% across twelve spatial understanding, real-world, and generic VQA benchmarks. Notably, the model surpasses GPT-4o by 3.

4% on average, and Claude 3. 5 Sonnet by 10. 1%, achieving a 12. 1% gain on the 3DSRBench benchmark. The team highlights that dense spatial rewards nearly doubled the benefit of reinforcement learning, demonstrating the effectiveness of visually-grounded perception.

Spatial Reasoning with Limited Data

Scientists have developed SpatialThinker, a new multimodal large language model (MLLM) that significantly advances 3D spatial understanding, achieving breakthrough results with limited data. The research team focused on enabling the model to perceive and reason about space in a manner analogous to human cognition, constructing a system that simulates human-like spatial perception by building scene graphs of relevant objects and their relationships. This innovative approach allows SpatialThinker to reason through complex scenarios via dense spatial rewards, guiding the model towards accurate answers. A key achievement of this work is the creation of STVQA-7K, a high-quality spatial visual question answering dataset, generated through a scalable data synthesis pipeline capable of producing up to 108,000 samples.

By training SpatialThinker-7B on this dataset of only 7,000 samples, researchers outperformed both supervised fine-tuning, achieving a 6% improvement, and conventional reinforcement learning baselines, with a 3. 2% gain. The model surpasses the performance of GPT-4o, demonstrating an average improvement of 3. 4% across twelve spatial understanding, real-world, and generic VQA benchmarks, and a substantial 12. 1% gain on the 3DSRBench benchmark.

Notably, SpatialThinker-7B, trained with dense spatial rewards, achieved a 7. 2% average improvement across all benchmarks, nearly doubling the 4% gain achieved with vanilla reinforcement learning using sparse rewards. This demonstrates that the model benefits significantly from richer learning signals that incentivize visually-grounded perception. The results confirm that properly-guided reinforcement learning surpasses static patterns learned from much larger datasets, validating the effectiveness of the approach in enabling robust 3D spatial understanding with minimal data.

Spatial Reasoning with Dense Rewards Achieved

SpatialThinker represents a significant advance in the field of multimodal large language models, specifically addressing limitations in spatial understanding. Researchers developed a system trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning, enabling more robust 3D spatial comprehension. The team achieved this by constructing a scene graph of relevant objects and spatial relationships, then guiding the model’s reasoning process with dense spatial rewards. Notably, SpatialThinker surpasses both proprietary and open-source models on spatial, real-world, and general visual question answering benchmarks, even when trained on considerably less data.

The use of dense spatial rewards nearly doubled the performance gains compared to standard reinforcement learning approaches, highlighting the value of rich supervisory signals for spatial reasoning. While the current implementation relies on explicit scene graphs, the authors acknowledge this as a potential area for future development. Further research directions include extending the reward framework to encompass spatiotemporal reasoning and applying it to real-world tasks such as web navigation, ultimately aiming for unified policies capable of handling diverse visual tasks.

👉 More information
🗞 SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
🧠 ArXiv: https://arxiv.org/abs/2511.07403

Tags:

3D spatial understanding dense spatial rewards Multimodal Large Language Models Reinforcement Learning scene graphs Spatial Reasoning spatial supervision STVQA-7K Visual Question Answering

Spatialthinker: Multimodal LLM Achieves 3D Reasoning with Spatial Rewards and STVQA-7K Dataset

Spatial Reasoning via Reinforcement Learning

Spatial Reasoning with Scene Graph Reinforcement Learning

Spatial Reasoning with Limited Data

Spatial Reasoning with Dense Rewards Achieved

Rohail T.

Latest Posts by Rohail T.:

Shows 100s of Microseconds TES Readout Via Frequency Domain Multiplexing

Hetccl Shows Scaling of Multi-Vendor GPU Clusters for Large Language Models

Daoism Reveals 18 Dimensions of Reflection for Human-Computer Interaction