One-Shot Refiner Achieves High-Fidelity Novel View Synthesis from Sparse Images

Researchers are tackling the challenge of creating realistic novel views from limited images, a crucial step towards immersive 3D experiences. Yitong Dong from Zhejiang University and Hangzhou VIVO Information Technology Co., Ltd., alongside Qi Zhang and Minchao Jiang from Hangzhou VIVO Information Technology Co., Ltd and Xidian University, et al., present a new framework that significantly boosts the fidelity of feed-forward 3D Gaussian Splatting methods, often hampered by low-resolution inputs and inconsistencies in generated details. Their innovative ‘One-Shot Refiner’ employs a Dual-Domain Detail Perception Module and a feature-guided diffusion network to preserve high-frequency details and ensure structural consistency across views, representing a substantial advance in generating high-quality 3D scenes from sparse imagery. This unified training strategy promises to unlock more realistic and detailed novel view synthesis than previously possible.

Resolving 3DGS limitations with dual-domain perception offers significant

Scientists have unveiled a groundbreaking framework for high-fidelity novel view synthesis (NVS) from sparse images, directly addressing limitations within current 3D Gaussian Splatting (3DGS) methods that utilise Vision Transformer (ViT) backbones. While ViT-based pipelines demonstrate robust geometric priors, they are frequently hampered by computational costs, restricting their ability to process high-resolution inputs. Existing generative enhancement techniques often lack 3D awareness, leading to inconsistencies in structures across different views, particularly in areas not directly observed. Experiments conclusively demonstrate that this method consistently maintains superior generation quality across multiple datasets, marking a significant advancement in the field. The core innovation lies in a two-pronged conditioning strategy, employing a dedicated guidance branch to relay explicit geometric priors from the 3D backbone, anchoring the generative process to the scene’s true structure.
Furthermore, the input view serves as a reference condition, guaranteeing that the final output preserves fine-grained information, resulting in an end-to-end framework for enhancing high-quality novel view synthesis from unposed sparse inputs. They also developed a Feature-Guided One-Step Diffusion architecture capable of preserving these details during restoration, and proposed an integrated training framework for end-to-end optimisation of the ViT reconstruction backbone and the diffusion-based image enhancement module. The study pioneered a feature-guided diffusion network designed to preserve these high-frequency details during the restoration process, a critical step in achieving photorealistic results.

This network doesn’t rely on iterative sampling; instead, it delivers a one-step Stable Diffusion module for rapid, high-fidelity target view synthesis. Crucially, the approach employs a unified training strategy, allowing for the joint optimisation of the ViT-based geometric backbone and the diffusion-based refinement module, streamlining the entire process. Experiments involved feeding unposed input images into the system, reconstructing 3D Gaussians within a canonical space, and then leveraging the diffusion module to generate novel views. The team implemented a two-pronged conditioning strategy, relaying explicit geometric priors from the 3D backbone to anchor the generative process to the scene’s true structure.

Simultaneously, the input view served as a reference condition, ensuring preservation of fine-grained information in the final output. This innovative conditioning method ensures geometrically consistent results across different viewpoints. Furthermore, the research details the addition of extra features to the Gaussians, specifically designed to store high-frequency details, enhancing the overall visual fidelity of the generated images. The integrated training framework achieves end-to-end optimisation, allowing the ViT reconstruction backbone and the diffusion-based image enhancement to work synergistically0.3, that relays explicit geometric priors from the 3D backbone, anchoring the generative process to the true scene structure. Simultaneously, the input view serves as a reference condition, guaranteeing preservation of fine-grained information in the final output. Consequently, the team achieved an end-to-end framework for enhancing high-quality NVS from unposed sparse inputs, as shown in Fig0.1.

👉 More information
🗞 One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion
🧠 ArXiv: https://arxiv.org/abs/2601.14161

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Superconductivity in Sc MTe Achieves 50% Enhancement Via Applied Pressure

Superconductivity in Sc MTe Achieves 50% Enhancement Via Applied Pressure

January 26, 2026
Flexllm Achieves 12.68 Wikitext-2 PPL with Novel LLM Accelerator Design

Flexllm Achieves 12.68 Wikitext-2 PPL with Novel LLM Accelerator Design

January 26, 2026
Dtp Framework Achieves Higher Vision-Language Action Success Rates by Pruning Tokens

Dtp Framework Achieves Higher Vision-Language Action Success Rates by Pruning Tokens

January 26, 2026