Accurate and robust simultaneous localization and mapping, or SLAM, remains a core challenge for robots operating in real-world environments, and current methods often struggle with maintaining geometric consistency. Yuchen Wu, Jiahe Li, and Fabio Tosi, along with colleagues from the University of Bologna and Beihang University, now present FoundationSLAM, a new system that overcomes these limitations by integrating depth estimation with geometric reasoning. The team achieves this through a novel network that generates geometry-aware correspondences, enabling consistent depth and pose estimation, and a refinement mechanism that focuses on reliable data, ultimately creating a closed feedback loop between matching and optimization. This approach delivers significantly improved trajectory accuracy and dense reconstruction quality, while also operating in real-time, representing a substantial advance towards practical and reliable visual SLAM systems.
The core innovation lies in bridging dense optical flow estimation with geometric reasoning, guided by depth information, to address inconsistencies present in previous flow-based methods.
FoundationSLAM, Real-Time Tracking and Mapping
FoundationSLAM achieves state-of-the-art performance in both tracking and mapping while operating in real-time on standard RGB input, demonstrating strong generalization and robustness across challenging benchmarks. The team designed a Hybrid Flow Network that generates geometry-aware correspondences, enabling consistent depth and pose estimation across multiple keyframes, a crucial component of the system’s success. To enforce global consistency, researchers propose a Bi-Consistent Bundle Adjustment Layer, which simultaneously optimizes keyframe pose and depth using multi-view constraints, resulting in a tightly coupled framework. Furthermore, a Reliability-Aware Refinement mechanism dynamically adapts the flow update process, distinguishing between reliable and uncertain regions, and creating a closed feedback loop between matching and optimization.
This refinement process allows the system to focus computational resources on areas needing the most correction, enhancing overall performance. Experiments demonstrate that FoundationSLAM outperforms existing monocular dense SLAM systems on standard benchmarks, including EuRoC, TartanAir, Tanks and Temples, and SLAM3R, as well as various RGB-D datasets. Results demonstrate superior trajectory accuracy and dense reconstruction quality, establishing a new performance baseline in the field. The system delivers real-time performance, achieving 18 frames per second, and exhibits strong generalization across diverse scenarios and practical applicability. FoundationSLAM represents a significant advancement in robotic perception and autonomous navigation, offering a robust and efficient solution for creating detailed maps of unknown environments.
Hybrid Flow Network Enables Robust SLAM
The system’s success stems from its ability to integrate depth information with geometric reasoning, addressing inconsistencies found in previous flow-based methods. The Bi-Consistent Bundle Adjustment Layer significantly improves tracking and reconstruction accuracy, particularly in challenging scenes where geometric consistency is often compromised. This layer simultaneously optimizes keyframe pose and depth using multi-view constraints, creating a tightly coupled framework.
👉 More information
🗞 FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
🧠 ArXiv: https://arxiv.org/abs/2512.25008
