Scendi Achieves Realistic 3D Urban Scene Generation with 2D Detail Enhancement

Researchers are tackling the complex problem of generating photorealistic and controllable urban scenes, a significant hurdle in fields like autonomous driving and virtual reality. Hanlei Guo, Jiahao Shao, and Xinya Chen, all from Zhejiang University, alongside Xiyang Tan, Sheng Miao (Zhejiang University), and Yujun Shen (Ant Group), present ScenDi , a novel approach that uniquely combines the strengths of both 3D and 2D generation techniques. Their work overcomes the limitations of existing methods, which often struggle with either realistic detail or precise camera control, by first creating a coarse 3D scene and then refining it with a 2D video diffusion model. This cascading process allows ScenDi to generate detailed, controllable urban environments, as demonstrated through compelling results on the Waymo and KITTI-360 datasets , paving the way for more immersive and realistic simulations.

3D Scene Generation via Diffusion Integration

Scientists have unveiled ScenDi, a novel method for generating realistic 3D urban scenes by seamlessly integrating both 3D and 2D diffusion models . The research addresses a critical limitation in current 3D scene generation techniques, which often struggle with either a lack of detailed appearance or limited camera control. ScenDi overcomes these challenges by first training a 3D latent diffusion model to generate 3D Gaussians, effectively creating a coarse 3D scene at a relatively low resolution. This initial 3D generation process can be further refined and controlled through optional inputs such as 3D bounding boxes, road maps, or even text prompts, allowing for targeted scene creation.
The team achieved a breakthrough by then training a 2D video diffusion model to enhance the appearance details of the scene, conditioned on the rendered images from the initial 3D Gaussians. By leveraging this coarse 3D scene as a guide, ScenDi generates high-fidelity scenes that accurately adhere to specified camera trajectories, a crucial feature for applications like autonomous driving simulation. Experiments conducted on the challenging Waymo and KITTI-360 datasets demonstrate the effectiveness of this cascaded 3D-to-2D approach, proving its ability to generate detailed and controllable urban environments. This innovative framework allows for high-fidelity urban scene generation without sacrificing the flexibility of camera control, outperforming methods that generate appearance solely in 2D.

This work establishes a new paradigm in 3D scene generation, moving beyond purely 3D or 2D approaches to harness the complementary strengths of both modalities. Researchers developed a novel Voxel-to-3DGS VQ-VAE, learned from 2D supervision, which enables the sampling of 3D Gaussian Splatting scenes that render multi-view consistent images. However, initial renderings from the 3D latent diffusion model often lacked high-frequency details and struggled to capture distant regions, prompting the team to condition a 2D video diffusion model on the rendered RGB images. This conditioning process leverages the 2D model’s ability to refine details and synthesize regions beyond the initial 3D representation, resulting in significantly improved visual quality and realism.

The study reveals that ScenDi’s cascaded design not only enhances visual fidelity but also improves training efficiency and loop consistency compared to methods that rely solely on 2D generation. By establishing geometric and coarse appearance priors in 3D, and then refining details in 2D, ScenDi offers a robust and scalable solution for creating complex urban environments. The research opens exciting possibilities for applications in gaming, virtual reality, and, most notably, the development of more realistic and reliable simulations for autonomous vehicles, paving the way for safer and more efficient transportation systems.

3D Gaussian Splatting with Diffusion Control offers impressive

Scientists pioneered ScenDi, a novel method for generating realistic 3D urban scenes by integrating both 3D and 2D diffusion models. The research team addressed limitations in existing techniques, where 3D-only methods struggle with appearance details and 2D-only approaches compromise camera control, a crucial aspect for applications like driving simulation. Initially, they trained a 3D latent diffusion model to generate 3D Gaussians, effectively creating coarse 3D scenes at a relatively low resolution, enabling efficient rendering of initial images. This 3D Gaussian Splatting (3DGS) generation process was designed to be optionally conditioned by inputs such as 3D bounding boxes, road maps, or text prompts, providing flexible control over the generated environment.

The study then developed a 2D video diffusion model to enhance the appearance details of the scenes, conditioning it on the rendered images produced by the 3D latent diffusion model, a key innovation in their cascaded approach. Researchers harnessed the power of this 2D model to refine details and synthesize regions beyond the initial 3D rendering range, effectively bridging the gap between coarse 3D geometry and high-fidelity visual realism. This method achieves high-fidelity urban scene generation without sacrificing camera controllability, outperforming approaches that generate appearance purely in 2D space. To facilitate 3D scene creation, the team engineered a Voxel-to-3DGS VQ-VAE, learned from 2D supervision, and integrated it with a 3D diffusion model, leveraging an off-the-shelf depth estimator to construct input voxel grids. This innovative pipeline enables sampling of multi-view consistent images from the generated 3D Gaussians, providing a solid foundation for subsequent 2D refinement. Experiments conducted on the challenging Waymo and KITTI-360 datasets demonstrate the effectiveness of ScenDi, confirming its ability to generate high-quality urban scenes while maintaining accurate camera trajectories, a significant advancement in the field of 3D scene generation.

ScenDi generates detailed 3D urban environments successfully

Scientists have developed ScenDi, a novel method for generating realistic 3D urban scenes by effectively integrating both 3D and 2D diffusion models. The research addresses the limitations of existing techniques, which often suffer from either a loss of detail or restricted camera control. Experiments demonstrate that ScenDi successfully generates desired scenes based on specified inputs while maintaining accurate camera trajectories. The team trained a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution, and this process can be optionally conditioned by inputs such as 3D bounding boxes, road maps, or text prompts.

Results demonstrate the effectiveness of a cascaded approach, beginning with coarse 3D scene generation and refining it with a 2D video diffusion model. The 3D latent diffusion model generates 3D Gaussians, and rendered images from this process serve as guidance for the 2D video model, enhancing appearance details and synthesizing regions beyond the initial 3D volume. Specifically, the study leverages a VQ-VAE to map voxel grids, derived from an off-the-shelf depth estimator, directly to 3D Gaussians, bypassing the need for extensive pre-processing of datasets. Measurements confirm that this approach allows for the sampling of 3D Gaussian Splatting scenes that render multi-view consistent images.

The breakthrough delivers high-fidelity urban scenes while preserving accurate camera control, as validated on the challenging Waymo and KITTI-360 datasets. The team trained a 2D video diffusion model conditioned on the rendered RGB images from the 3D latent diffusion model, effectively refining details and extending the scene beyond the predefined 3D range. Tests prove that the 3D-to-2D diffusion cascades allow for generating high-fidelity urban scenes, and the method avoids the over-saturation issues often seen in other approaches. The work successfully integrates diverse control signals for flexible scene generation while maintaining an explicit 3D Gaussian Splatting backbone.

Furthermore, the research details the architecture of the 3D latent diffusion model, which consists of a 3D VQ-VAE and a latent-space diffusion model. The 3D VQ-VAE reconstructs scenes in a feed-forward manner, taking a colored voxel grid as input and outputting a set of 3D Gaussian primitives. Measurements show that this approach achieves improved training efficiency and loop consistency compared to methods relying on semantic voxel grids or 2D renderings for appearance synthesis. The study’s method directly generates coarse 3D Gaussians and refines them with a 2D model, offering a significant advancement in urban scene generation technology.

3D Scene Generation via Diffusion Integration

Scientists have developed ScenDi, a new method for generating realistic 3D urban scenes by integrating both 3D and 2D generative techniques. The approach initially employs a 3D latent diffusion model to create 3D Gaussians, allowing for rendering of images at a relatively low resolution, and this process can be guided by inputs like bounding boxes, road maps, or text prompts. Subsequently, a 2D video diffusion model refines the appearance details, conditioned on the rendered images from the 3D Gaussians, resulting in detailed scenes with accurate camera trajectories. Researchers demonstrated the effectiveness of ScenDi on the Waymo and KITTI-360 datasets, showcasing its ability to generate diverse scenes with varying vehicle layouts and road structures.

The method also exhibits generative capabilities, successfully inpainting parts of existing scenes and altering their features, for example, changing the shape of a house, while maintaining overall scene coherence. However, the authors acknowledge that the quality of the 2D video diffusion stage is dependent on the initial 3D generation, and unsatisfactory 3D outputs can lead to artifacts. This work highlights a promising direction for complex scene generation by combining the strengths of both 3D and 2D diffusion models, fully unlocking their potential. Future research could focus on scaling up training data and model size to further enhance the visual quality of the generated 3D scenes, addressing the current limitation of reliance on the initial 3D generation quality. The integration of these diffusion paradigms offers a novel approach to creating detailed and controllable urban environments.

👉 More information
🗞 ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
🧠 ArXiv: https://arxiv.org/abs/2601.15221

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Egowm Achieves 25-Dof Humanoid Prediction with Action-Conditioned World Models

Egowm Achieves 25-Dof Humanoid Prediction with Action-Conditioned World Models

January 23, 2026
Stripe Modelling Achieves 13% Carbon Retention in Diii-D Tokamak PMI Simulations

Stripe Modelling Achieves 13% Carbon Retention in Diii-D Tokamak PMI Simulations

January 23, 2026
Cai4sg Advances Social Good: Examining Roles, Challenges and Emerging Trends

Cai4sg Advances Social Good: Examining Roles, Challenges and Emerging Trends

January 23, 2026