Perceiving the depth and surface characteristics of transparent objects presents a significant challenge for computer vision systems, as traditional methods struggle with the way light bends, reflects, and passes through these materials. Shaocong Xu, Songlin Wei, Qizhe Wei, and colleagues from their respective institutions, address this problem by demonstrating that existing video generation technology already understands the physics of transparency. The team developed a large synthetic video dataset, TransPhy3D, and repurposed a powerful video diffusion model to accurately estimate both depth and surface normals for transparent and reflective objects, achieving state-of-the-art performance on multiple benchmarks. This innovative approach, named DKT, not only improves the accuracy and stability of depth estimation in challenging scenarios, but also enhances robotic grasping success rates, suggesting that generative video models hold considerable promise for advancing real-world perception and manipulation.

Diffusion Models for High Fidelity 3D Generation

Recent research demonstrates significant progress in 3D reconstruction, scene understanding, and generative AI, with diffusion models emerging as a dominant technique. Several projects focus on creating 3D models from images and videos, including Octfusion, which uses diffusion to generate detailed 3D shapes, and Hi3dgen, which generates high-quality 3D geometry from images. Other advancements address the challenge of bridging simulated and real 3D data, such as Stable-sim2real, and establishing foundational models for dense 3D prediction, like Lotus. Improvements to existing 3D representations are also being explored, with Gaussianshader enhancing the realism of Gaussian Splatting through shading functions, and Gaussreg enabling efficient 3D point cloud alignment.

Researchers are also refining surface normal consistency in videos using diffusion techniques, as seen in Normalcrafter and Stablenormal. Furthermore, advancements in neural rendering, including Neuralangelo, Instant-NGP, and NeRF, continue to improve rendering speed and quality. Alongside 3D reconstruction, robotics and simulation are increasingly intertwined, with projects like Maniskill3 providing high-performance simulation environments for embodied AI. Efficient robot motion planning is addressed by curobo, while Anygrasp focuses on robust grasp perception. A key trend is the application of diffusion models to diverse tasks, alongside efficient fine-tuning of large language models using techniques like LoRA, and the development of large multimodal models such as Qwen2.5-vl.

Transparent and Reflective Depth Perception Breakthrough

Scientists have overcome longstanding challenges in computer vision and robotics by achieving a breakthrough in depth perception for transparent and reflective objects. Addressing the difficulties posed by refraction, reflection, and transmission, the team constructed TransPhy3D, a synthetic video corpus of 11,000 sequences rendered using physically based ray tracing. This dataset, comprising 1.32 million frames, specifically focuses on transparent objects, expanding existing resources designed for single-frame depth estimation. Experiments revealed that a model, termed DKT, achieves state-of-the-art performance in zero-shot testing on both synthetic and real-world benchmarks, including ClearPose, DREDS, and TransPhy3D-Test.

The results demonstrate significant improvements in accuracy and temporal consistency compared to existing image and video baselines, such as Depth-Anything-v2 and DepthCrafter. A compact 1.3 billion parameter version of DKT operates at 0.17 seconds per frame when processing 832×480 resolution video. Integrating DKT into a robotic grasping system boosted success rates across translucent, reflective, and diffuse surfaces, surpassing the performance of prior depth estimators. These findings suggest that generative video models possess an inherent understanding of transparency, enabling their repurposing for robust and temporally coherent perception in challenging real-world manipulation tasks.

Transparent Object Depth and Normal Estimation

This research presents significant advances in video depth and normal estimation, particularly for transparent and highly reflective objects, which pose difficulties for conventional perception systems. The team developed TransPhy3D, a novel synthetic video dataset containing over eleven thousand sequences of transparent and reflective scenes, built from a diverse collection of assets and physically based rendering techniques. This dataset addresses a critical need for training data in this area, enabling the creation of more robust algorithms. Building upon this dataset, researchers introduced DKT, a depth and normal estimation model efficiently finetuned from a large video diffusion model using a lightweight adaptation strategy.

Extensive evaluation across both synthetic and real-world benchmarks demonstrates that DKT achieves state-of-the-art performance in video depth and normal estimation, consistently outperforming existing methods. Integrating DKT into a robotic grasping system resulted in improved success rates when manipulating translucent, reflective, and diffuse surfaces. While the work relies on synthetic data, and performance is benchmarked against existing algorithms, future research will focus on bridging the gap between synthetic and real-world data, potentially through domain adaptation techniques or the incorporation of additional sensory information. These findings suggest that generative video priors can be effectively repurposed to enhance perception capabilities for challenging real-world manipulation tasks.

👉 More information
🗞 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
🧠 ArXiv: https://arxiv.org/abs/2512.23705

Muhammad Rohail T.

Video Diffusion Achieves 0.17 Accuracy for Transparency

Diffusion Models for High Fidelity 3D Generation

Transparent and Reflective Depth Perception Breakthrough

Transparent Object Depth and Normal Estimation

Latest Posts by Muhammad Rohail T.:

Stable Qubits Enable Quantum Error Correction

Focused Atoms Improve Gravity Sensing Precision

Quantum Light Speeds Atomic Ionization