Multi-Image Spatial Reasoning Benchmark Reveals Gaps in AI Understanding

A new benchmark, MMSI-Bench, assesses multi-image spatial reasoning in large multimodal models. Evaluations of 34 models reveal significant performance gaps, with the strongest open-source model achieving 30% accuracy and OpenAI’s o3 reaching 40%, compared to human scores of 97%. Error analysis identifies key failure modes in spatial understanding.

The ability to reason about spatial relationships from multiple visual inputs remains a significant challenge for artificial intelligence systems operating in real-world environments. Current evaluation benchmarks largely focus on single-image analysis, failing to adequately test a system’s capacity for multi-image spatial intelligence. To address this limitation, researchers from the Shanghai AI Laboratory and Beijing Normal University have developed MMSI-Bench, a new visual question answering (VQA) benchmark designed to rigorously assess this crucial capability. The work, detailed in a forthcoming publication, is the result of a collaborative effort led by Jingli Lin, Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, and Tai Wang, with contributions from Jiangmiao Pang. The team dedicated over 300 hours to constructing MMSI-Bench, comprising 1,000 multiple-choice questions derived from a dataset of over 120,000 images, each accompanied by carefully constructed incorrect answers and detailed reasoning pathways.

New Benchmark Reveals Limitations in AI Spatial Reasoning Across Multiple Images

Recent advances in multimodal large language models (MLLMs) – artificial intelligence systems that process both text and images – necessitate rigorous evaluation of their spatial reasoning abilities, particularly when analysing multiple images simultaneously. Researchers have introduced MMSI-Bench, a new visual question answering (VQA) benchmark designed to specifically assess multi-image spatial intelligence, addressing a noted deficiency in existing evaluation metrics.

MMSI-Bench comprises 1,000 multiple-choice questions constructed from a dataset of over 120,000 images. Its creation involved over 300 hours of manual effort by six researchers specialising in 3D vision, ensuring question clarity and the development of plausible, yet incorrect, answer options (distractors).

Evaluations of 34 open-source and proprietary MLLMs reveal a substantial performance gap. The highest-performing open-source model achieved approximately 30% accuracy on MMSI-Bench, while OpenAI’s o3 model attained 40%. This contrasts sharply with human performance, which averaged 97%. This indicates a significant distance remains before current AI systems match human capabilities in this domain.

To facilitate targeted improvements, the researchers leveraged detailed, step-by-step reasoning annotations accompanying each question to create an automated error analysis pipeline. This analysis identified four primary failure modes:

  • Grounding errors: Incorrect identification of objects within the images.
  • Overlap-matching and scene-reconstruction errors: Difficulties interpreting spatial relationships between objects and reconstructing the overall scene.
  • Situation-transformation reasoning errors: Failures to account for changes occurring within a scene across multiple images.
  • Spatial-logic errors: Deficiencies in applying fundamental principles of spatial reasoning.

This granular breakdown provides a clear pathway for refining MLLM architectures and training strategies.

MMSI-Bench establishes a considerable performance gap between current MLLMs and human capabilities in multi-image spatial intelligence. The benchmark’s design specifically targets this capability – a crucial aspect often overlooked by existing evaluation metrics, which predominantly focus on single-image analysis.

The benchmark serves not only as a rigorous evaluation tool but also as a catalyst for future research, driving innovation in the field. By pinpointing specific areas of weakness, it directs efforts towards developing more robust and spatially aware MLLMs capable of navigating and understanding complex visual environments. The open-source nature of the benchmark and its associated resources will further encourage collaboration and innovation.

Researchers are actively exploring methods to address these limitations, including incorporating more complex spatial reasoning tasks, developing more robust training datasets, and designing novel MLLM architectures that explicitly model spatial relationships. Related work encompasses visual question answering datasets, object detection algorithms, and scene graph generation techniques – all contributing to the broader goal of enabling machines to understand and reason about the visual world. Future research directions include utilising reinforcement learning to train MLLMs for complex spatial reasoning, incorporating prior environmental knowledge into models, and employing explainable AI techniques to understand decision-making processes regarding spatial relationships.

The availability of MMSI-Bench and its associated resources will undoubtedly accelerate progress in multi-image spatial intelligence, enabling the development of more capable and reliable MLLM models. By providing a challenging and comprehensive benchmark, MMSI-Bench will serve as a valuable tool for evaluating and comparing different approaches to spatial reasoning, ultimately leading to AI systems that can better understand and interact with the visual world.

👉 More information
🗞 MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
🧠 DOI: https://doi.org/10.48550/arXiv.2505.23764

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Amera IoT Unveils Quantum-Proof Encryption Backed by 14 US Patents

Amera IoT Unveils Quantum-Proof Encryption Backed by 14 US Patents

January 17, 2026
Literacy Research Association’s 76th Conference Adopts Quantum Lens for Innovation

Literacy Research Association’s 76th Conference Adopts Quantum Lens for Innovation

January 17, 2026
DEEPX Named “What Not To Miss” Exhibitor at CES 2026 for Second Year

DEEPX Named “What Not To Miss” Exhibitor at CES 2026 for Second Year

January 17, 2026