Pixel-perfect Depth Foundation Model Advances Geometry Prediction Using Pixel-Space Diffusion

Accurate geometric reconstruction from images underpins advancements in fields such as robotics and augmented reality, yet current methods often struggle with inaccuracies and a loss of detail. Gangwei Xu, alongside Haotong Lin from Zhejiang University and Hongcheng Luo, Haiyang Sun, Bing Wang, and Guang Chen from Xiaomi EV, present a new approach to visual geometry estimation designed to overcome these limitations. Their research introduces pixel-perfect models capable of generating high-quality, clean point clouds by employing generative modelling directly within the pixel space. This work is significant because it tackles the computational demands of this process through innovations like Semantics-Prompted DiT and Cascade DiT, ultimately delivering superior performance and clearer reconstructions compared to existing state-of-the-art techniques for both static images and video. The team further extends their approach to video with PPVD, ensuring temporal consistency through a novel Semantics-Consistent DiT and reference-guided token propagation.

Accurate geometric reconstruction from images underpins advancements in fields such as robotics and augmented reality, yet current methods often struggle with inaccuracies and a loss of detail., Gangwei Xu, alongside Haotong Lin from Zhejiang University and Hongcheng Luo, Haiyang Sun, Bing Wang, and Guang Chen from Xiaomi EV, present a new approach to visual geometry estimation designed to overcome these limitations., Their research introduces pixel-perfect models capable of generating high-quality, clean point clouds by employing generative modelling directly within the pixel space., This work is significant because it tackles the computational demands of this process through innovations like Semantics-Prompted DiT and Cascade DiT, ultimately delivering superior performance and clearer reconstructions compared to existing state-of-the-art techniques for both static images and video., The team further extends their approach to video with PPVD, ensuring temporal consistency through a novel Semantics-Consistent DiT and reference-guided token propagation., Leveraging generative modeling in the pixel space, this research introduces Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT)., Addressing the high computational complexity associated with pixel-space diffusion, the work proposes two key designs to improve performance., First, Semantics-Prompted DiT incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details., Secondly, a Cascade DiT architecture progressively increases the number of image tokens, improving both efficiency and accuracy., To further extend PPD to video, the researchers introduce a new Semantics-Co.

Pixel-Perfect Depth via Diffusion Transformers Achieved Scientists have

Scientists have developed Pixel-Perfect Depth (PPD), a novel framework for monocular depth estimation utilising pixel-space diffusion transformers., The research addresses the persistent problem of ‘flying pixels’ and loss of fine detail in existing geometry foundation models, delivering significantly cleaner point clouds., Experiments reveal that the team achieved up to a 78% gain on the NYUv2 AbsRel metric through the implementation of Semantics-Prompted Diffusion Transformers (SP-DiT), substantially improving overall performance in depth prediction., This breakthrough demonstrates a marked improvement in preserving both global semantic coherence and fine-grained visual details within high-resolution images., To further enhance efficiency and accuracy, researchers introduced the Cascade DiT (Cas-DiT) architecture., This design progressively increases the number of image tokens, beginning with larger patch sizes to model global structures and transitioning to smaller patch sizes for fine-grained detail generation., Measurements confirm that Cas-DiT not only reduces computational costs but also improves the overall quality of depth estimation, effectively balancing efficiency with precision., The work successfully bypasses the need for a Variational Autoencoder (VAE), a common source of detail loss in previous generative depth models, resulting in sharper geometric edges and more faithful reconstruction of structures., Extending PPD to video, the team created Pixel-Perfect Video Depth (PPVD) incorporating a new Semantics-Consistent DiT (SC-DiT)., SC-DiT extracts temporally consistent semantics from a multi-view geometry foundation model and employs reference-guided token propagation, maintaining temporal coherence with minimal computational overhead., Tests prove that this approach overcomes limitations in previous video depth estimation models, which often lacked joint spatiotemporal propagation and failed to account for camera motion., The resulting PPVD delivers stable, high-quality depth predictions for arbitrarily long video sequences, eliminating the flickering often seen in prior work., The models consistently outperform all other generative monocular and video depth estimation models, producing demonstrably cleaner point clouds., Data shows the effectiveness of the approach in applications requiring high-precision geometric data, such as robotics, autonomous driving, and augmented reality rendering., Code for the research is publicly available, facilitating further development and application of this pixel-perfect geometry estimation technology.

Pixel Space Generative Models for Depth Estimation

Researchers have developed Pixel-Perfect Depth (PPD) for monocular depth estimation and Pixel-Perfect Video Depth (PPVD) for video depth estimation, both novel approaches to generating high-quality point clouds., These models utilise generative modelling directly within the pixel space, a departure from previous methods reliant on latent-space diffusion with Variational Autoencoders, and successfully avoid the introduction of ‘flying pixels’, a common artifact in reconstructed 3D geometry., The work addresses the computational demands of pixel-space diffusion through the introduction of Semantics-Prompted DiT and Semantics-Consistent DiT architectures, which enhance both accuracy and temporal coherence., The resulting PPD and PPVD models demonstrate state-of-the-art performance in generative monocular and video depth estimation, achieving cleaner point clouds than existing techniques., The authors acknowledge that the computational complexity of pixel-space diffusion remains a challenge, despite the efficiency gains from the Cascaded DiT architecture. Future research could explore further optimisation of these processes and investigate the application of these models to more complex and dynamic scenes, potentially broadening their utility in robotics and augmented reality applications.

👉 More information
🗞 Pixel-Perfect Visual Geometry Estimation
🧠 ArXiv: https://arxiv.org/abs/2601.05246

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Dissipative Continuous Time Crystals Achieve High-Precision Microwave Sensing with Rapid Frequency Switching

Dissipative Continuous Time Crystals Achieve High-Precision Microwave Sensing with Rapid Frequency Switching

January 12, 2026
Hexagonal Boron Nitride Sensing Enables 16-Fold Faster Spin Relaxation Measurements

Hexagonal Boron Nitride Sensing Enables 16-Fold Faster Spin Relaxation Measurements

January 12, 2026
Machine Learning Achieves Coherence and Entanglement Estimation with Minimal Resources for Unknown States

Machine Learning Achieves Coherence and Entanglement Estimation with Minimal Resources for Unknown States

January 12, 2026