Understanding how light interacts with surfaces is crucial for computer vision, yet current methods struggle to separate the complex interplay of material, illumination, and viewing angle in images. Kang Du, Yirui Guan from Tencent, and Zeyu Wang from The Hong Kong University of Science and Technology, along with their colleagues, address this challenge with a new approach to intrinsic image decomposition, a technique that disentangles these visual components. Their work introduces the Intrinsic Decomposition Transformer, a system that processes multiple images simultaneously to create consistent and realistic representations of reflectance, shading, and specular highlights, all in a single step. This physically grounded method not only improves the clarity and accuracy of decomposed images, but also significantly enhances consistency across different viewpoints, representing a substantial advance in the field of visual understanding and 3D scene reconstruction.
Albedo, Shading, and Neural Radiance Fields
The field of computer vision increasingly focuses on intrinsic image decomposition, the process of separating an image into its constituent parts: albedo (surface color/reflectance) and shading (illumination). This is crucial for applications like relighting, material editing, and scene understanding, and is closely linked to inverse rendering, which estimates a scene’s 3D geometry, materials, and lighting from 2D images. Recent progress utilizes neural radiance fields (NeRF) and Gaussian Splatting to represent scenes and perform inverse rendering. Several innovative methods are advancing this area, including IDArb, which handles an arbitrary number of input views and illuminations, and Sail, which employs a latent diffusion model for self-supervised albedo estimation.
Gs-ID decomposes illumination on Gaussian Splatting using a diffusion prior and parametric light source optimization. Other notable techniques include PixelNeRF and NeRF, which represent scenes as neural radiance fields, and NeRV, which focuses on neural radiance and visibility fields for relighting and view synthesis. Further developments include NeRFactor, which factors shape and reflectance, and PhysG, which uses spherical Gaussians for physics-based material editing and relighting. Researchers also explore techniques like VGGTSLAM and VGGt, focusing on dense RGB SLAM and visual geometry grounded transformers, respectively.
TensorIR, Intrinsic Image Diffusion, GS-IR, and GaussianShader contribute to inverse rendering and intrinsic image estimation using techniques like tensorial representations, diffusion models, and 3D Gaussian Splatting. Key datasets driving this research include Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding, and the foundational Ground Truth Dataset. Core technologies underpinning these advancements are neural radiance fields, Gaussian Splatting, diffusion models, transformers, and differentiable ray tracing. These tools enable the creation of continuous volumetric scene representations, faster rendering, probabilistic generative modeling, and end-to-end training of inverse rendering pipelines. Ultimately, this research aims to accurately understand the materials and lighting of a scene from images, enabling applications in virtual reality, augmented reality, and realistic image editing. The field combines deep learning with classical computer vision techniques to achieve increasingly accurate and realistic results.
Single-Step Multi-View Image Decomposition via Transformers
Scientists developed the Intrinsic Decomposition Transformer (IDT), a novel framework for decomposing multi-view images into diffuse reflectance, diffuse shading, and specular shading in a single step. This method avoids iterative generative sampling, a limitation of previous approaches, by utilizing transformer-based attention mechanisms to jointly reason across multiple input images and ensure view consistency. The team engineered a physically grounded image formation model, explicitly separating images into reflectance, shading, and specular shading components, enabling interpretable and controllable decomposition of material and illumination effects. Experiments using both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components compared to existing methods. This improved accuracy and reliability of intrinsic image decomposition supports applications requiring accurate material editing, relighting, and 3D reconstruction. The technique faithfully reconstructs original appearances and supports relighting scenarios by altering illumination while preserving consistent material properties across viewpoints.
Multi-View Image Decomposition via Transformers
Scientists have developed the Intrinsic Decomposition Transformer (IDT) for separating images into their fundamental material and lighting components across multiple viewpoints. The team’s approach leverages transformer-based attention, allowing the system to jointly analyze several images simultaneously and produce view-consistent intrinsic factors in a single processing step. The IDT framework adopts a physically grounded image formation model, explicitly decomposing images into diffuse reflectance, diffuse shading, and specular shading, thereby separating Lambertian and non-Lambertian light transport. Experiments conducted using both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components compared to existing methods. The research delivers substantial improvements in multi-view consistency, a critical factor for applications like 3D reconstruction and scene understanding. This system achieves a more consistent and accurate decomposition of these intrinsic factors than previous approaches, effectively disentangling material properties from illumination and viewing conditions. By employing a transformer architecture, the method jointly analyzes images from different angles in a single step, avoiding iterative refinement processes. The resulting decompositions demonstrate cleaner reflectance estimations, more coherent shading, and better isolation of specular reflections, both in synthetic and real-world indoor scenes. This improved consistency across views represents a significant advancement in multi-view intrinsic image decomposition, offering a foundation for more robust and physically grounded visual understanding. The framework establishes a simple yet effective approach to scalable multi-view analysis, with potential for enhancing physically grounded reasoning in various computer vision applications.
👉 More information
🗞 IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition
🧠 ArXiv: https://arxiv.org/abs/2512.23667
