Metricanything Achieves Scalable Depth Estimation Using 20M Noisy Image-Depth Pairs

Researchers are tackling the difficult problem of scaling metric depth estimation, a crucial component for applications ranging from autonomous driving to robotics, despite challenges posed by noisy and varied 3D data. Baorui Ma, Jiahui Yang (from Li Auto Inc), and Donglin Di, et al, present MetricAnything, a novel pretraining framework designed to learn accurate metric depth from diverse 3D sources without requiring manual calibration or specialised architectures. This work is significant as it demonstrates, for the first time, a clear scaling trend in metric depth learning, achieving state-of-the-art performance on a range of downstream tasks including depth completion and monocular depth estimation. Furthermore, integrating the pretrained model enhances the spatial intelligence of multimodal large language models, suggesting a pathway towards scalable and efficient real-world 3D perception.

Sparse Metric Prompting for Depth Estimation

Scientists have unveiled Metric Anything, a novel pretraining framework designed to overcome challenges in metric depth estimation, a crucial component for enabling advanced perception in artificial intelligence systems. The research demonstrates, for the first time, a clear scaling trend in this field, mirroring the successes seen in 2D vision foundation models. This breakthrough addresses the difficulties of learning accurate metric depth from noisy, diverse 3D data originating from various sensors and cameras, which often suffer from inherent biases and ambiguities. Central to this work is the introduction of the Sparse Metric Prompt, a technique involving the random masking of depth maps.

This innovative approach creates a universal interface that effectively decouples spatial reasoning from the specific characteristics of individual sensors and cameras. By employing this method, the team achieved robust metric depth learning from heterogeneous sources without relying on manual adjustments or task-specific architectures. The framework was trained on a substantial dataset of approximately 20 million image-depth pairs, encompassing reconstructed, captured, and rendered 3D data from over 10,000 different camera models. Experiments reveal that the pretrained model excels in prompt-driven tasks, including depth completion, super-resolution, and radar-camera fusion.

Furthermore, a distilled, prompt-free student model achieves state-of-the-art performance across a range of applications, such as monocular depth estimation, camera intrinsics recovery, single and multi-view metric 3D reconstruction, and even visual language action (VLA) planning. The researchers also found that integrating the pretrained ViT from Metric Anything as a visual encoder significantly enhances the spatial intelligence of multimodal large language models. These findings establish that metric depth estimation can indeed benefit from the same scaling laws driving modern foundation models, paving the way for scalable and efficient real-world metric perception. To facilitate further research and development, the team has open-sourced MetricAnything, making it publicly available at http://metric-anything.github.io/metric-anything-io/. This open access approach is intended to foster collaboration and accelerate progress in the field of 3D perception and its applications in robotics, augmented reality, and autonomous systems.

Sparse Metric Pretraining for 3D Depth Estimation

Scientists introduced Metric Anything, a pretraining framework designed to learn metric depth from noisy, diverse 3D sources without requiring manual, camera-specific modelling or task-specific architectures. The team engineered a Sparse Metric, created by randomly masking depth maps, which functions as a universal interface decoupling spatial reasoning from sensor and camera biases. This innovative approach enabled the researchers to train a model on approximately 20 million image-depth pairs, encompassing reconstructed, captured, and rendered 3D data from 10,000 camera models. The study pioneered a clear scaling trend in metric depth estimation, demonstrating for the first time that larger models consistently improve performance.

Researchers collected multi-source 3D data, integrating reconstructed, captured, and rendered datasets to maximise diversity and scale. The work employed a data pipeline that processed these 20 million image-depth pairs, ensuring compatibility across various 3D data formats and sensor modalities. To address the challenges of heterogeneous sensor noise and camera biases, the team implemented the Sparse Metric prompting technique during pretraining. This involved randomly masking portions of the input depth maps, forcing the model to learn robust depth representations independent of specific sensor characteristics.

The approach enables the model to generalise effectively across different data sources and camera configurations. The study further developed a prompt-free model distillation method to transfer knowledge from the large pretrained model to a smaller, more efficient student network. This distillation process involved training the student model to mimic the outputs of the teacher model without relying on any explicit prompts or guidance signals. Experiments employed a Vision Transformer (ViT) architecture for both the teacher and student models, leveraging its ability to capture long-range dependencies in the input data.

The distilled model achieved state-of-the-art results on monocular depth estimation, camera intrinsics recovery, and multi-view 3D reconstruction, demonstrating the effectiveness of the distillation process. Furthermore, scientists harnessed the pretrained ViT as a visual encoder to significantly boost the spatial intelligence of Multimodal Large Language Models. This integration allowed the language model to better understand and reason about 3D scenes, improving its performance on tasks requiring spatial awareness. The team conducted extensive ablation studies, systematically varying data scale, network architecture, and training objectives to identify the key factors driving performance gains. These experiments revealed that scaling up the training data was crucial for achieving the observed scaling trend in metric depth estimation, establishing a solid foundation for versatile, data-driven metric perception.

Metric depth estimation scales with sparse training data

Scientists have introduced Metric Anything, a new pretraining framework designed to learn metric depth from noisy, diverse 3D sources without requiring manual, camera-specific modelling or task-specific architectures. The team measured performance using approximately 20 million image-depth pairs, encompassing reconstructed, captured, and rendered 3D data from over 10,000 camera models, and demonstrated, for the first time, a clear scaling trend in metric depth estimation. Central to this work is the Sparse Metric, created by randomly masking depth maps, which functions as a universal interface decoupling spatial reasoning from sensor and camera biases. Experiments revealed that the pretrained model excels at prompt-driven tasks, including depth completion, super-resolution, and Radar-camera fusion.

Furthermore, a distilled prompt-free student model achieved state-of-the-art results in monocular depth estimation, camera intrinsics recovery, single and multi-view metric 3D reconstruction, and VLA planning. Data shows that the framework’s ViT, when used as a visual encoder, significantly boosts the spatial intelligence capabilities of Multimodal Large Language Models. The researchers recorded a direct distance measurement between a window and a table of 1.6 meters in a video frame, demonstrating the model’s ability to provide accurate spatial information. Results demonstrate that metric depth estimation can benefit from the same scaling laws driving modern foundation models, establishing a new pathway for scalable and efficient real-world metric perception.

The team’s approach aggregates diverse 3D data into per-pixel metric depth maps, forming the aforementioned 20 million image-depth dataset. Sparse Metric Prompts, generated through random masking of depth maps, provide a minimal interface that decouples spatial reasoning, enabling metric depth learning from heterogeneous sources. Tests prove that the pretrained model and its distilled student generalize robustly across multiple downstream tasks, revealing a clear scaling trend and establishing a solid foundation for versatile, data-driven metric perception. Measurements confirm that the framework’s performance extends to unseen sensors, scenarios, and even extreme environmental conditions, showcasing its robustness and adaptability. The work includes detailed ablation studies examining the impact of data scaling, network architecture, runtime, training objectives, prompt settings, and balance weights, providing a comprehensive understanding of the model’s behaviour. The researchers have open-sourced MetricAnything to support community research.

Sparse Metrics Enhance 3D Depth Estimation

Researchers have developed a new pretraining framework called Metric Anything, designed to improve the accuracy and scalability of metric depth estimation from diverse 3D data. The framework addresses challenges posed by sensor noise, camera biases, and ambiguity in noisy 3D data, traditionally hindering progress in this field. By employing a ‘Sparse Metric’ , a technique involving the random masking of depth maps, the team decoupled spatial reasoning from camera and sensor-specific limitations, creating a universal interface for learning metric depth. The study demonstrates a clear scaling trend in metric depth estimation, achieved through pretraining a model on approximately 20 million image-depth pairs sourced from reconstructed, captured, and rendered 3D data across 10,000 camera models.

This pretrained model exhibits strong performance in various downstream tasks, including depth completion, super-resolution, and Radar-camera fusion. Furthermore, a distilled version of the model achieved state-of-the-art results in monocular depth estimation, camera intrinsics recovery, metric 3D reconstruction, and visual localisation and mapping (VLA) planning. Integrating the pretrained model as a visual encoder also enhanced the spatial intelligence of Multimodal Large Language Models. The authors acknowledge that the performance of the framework is dependent on the quality and diversity of the training data. Future research directions include exploring the application of Metric Anything to even more diverse datasets and investigating its potential for use in robotics and autonomous navigation. These findings establish that metric depth estimation can benefit from the scaling laws observed in modern foundation models, paving the way for more scalable and efficient real-world metric perception systems.

👉 More information
🗞 MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
🧠 ArXiv: https://arxiv.org/abs/2601.22054

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Matterhorn Shows 1.42% Energy Reduction Via Masked Time-To-First-Spike Encoding

Matterhorn Shows 1.42% Energy Reduction Via Masked Time-To-First-Spike Encoding

February 4, 2026
Shows Digital Twin Synchronization Architecture for Industry 4.0 Applications

Shows Digital Twin Synchronization Architecture for Industry 4.0 Applications

February 4, 2026
Researchers Reveal 24.2% Efficient Hexagonal Nanowire MAPbI3 Perovskite Solar Cells

Researchers Reveal 24.2% Efficient Hexagonal Nanowire MAPbI3 Perovskite Solar Cells

February 4, 2026