Researchers are tackling the high costs associated with traditional cinematic video production by developing methods for synthesising realistic footage using computational techniques. Kaiyi Huang, Yukun Huang, and Yu Li, from The University of Hong Kong, alongside Jianhong Bai of Zhejiang University, Xintao Wang from Kling Team at Kuaishou Technology, and Zinan Lin from Microsoft Research, present CineScene, a novel framework for cinematic video generation with decoupled scene context. This work is significant because it introduces an implicit 3D-aware scene representation that allows for camera-controlled video synthesis with consistent scenes and dynamic subjects, effectively reducing the need for expensive set construction. By leveraging a context conditioning mechanism and a new scene-decoupled dataset created in Unreal Engine 5, CineScene achieves state-of-the-art performance in generating high-quality, scene-consistent cinematic videos, even with substantial camera movements.
This research introduces a method for synthesising high-quality videos featuring dynamic subjects within a static environment, all while adhering to a user-defined camera trajectory.
CineScene leverages implicit 3D-aware scene representation, offering a flexible alternative to constructing physical sets and enabling diverse visual narratives without extensive on-set resources. The core innovation lies in a novel context conditioning mechanism that injects 3D-aware features implicitly into a pretrained text-to-video generation model.
By encoding scene images into visual representations using VGGT, the framework injects spatial priors through additional context concatenation, facilitating camera-controlled video synthesis with consistent scenes and dynamic subjects. This approach allows for the disentanglement of static scenes from dynamic content, jointly modelling a decoupled 3D structure and a dynamic subject for improved realism.
To enhance model robustness, researchers implemented a random-shuffling strategy for input scene images during training, preventing reliance on fixed image ordering and promoting a robust correspondence between generated content and scene context. Addressing the lack of suitable training data, a scene-decoupled dataset was constructed using Unreal Engine 5, containing paired videos with and without dynamic subjects, panoramic images of the static scene, and corresponding camera trajectories.
Experiments demonstrate that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, successfully handling large camera movements and exhibiting generalization across diverse environments. The framework overcomes limitations of previous methods, which were either restricted to static scenes or constrained to small view variations, generating scene-consistent videos with new dynamic subjects even under significant viewpoint changes.
Encoding Static Scenes and Injecting Context into Text-to-Video Generation
VGGT encoding forms the basis of a novel cinematic video generation framework called CineScene, designed to synthesise high-quality videos with dynamic subjects within a static environment. The work addresses limitations in existing cinematic video generation approaches by decoupling scene context and employing implicit 3D-aware scene representation.
Initially, static scene images are encoded into visual features using VGGT, creating an implicit 3D scene representation that captures spatial layout and camera information. These encoded features are then projected and injected into a pretrained text-to-video generation model as additional context tokens, enabling the preservation of scene structure during the generation of dynamic content.
To improve alignment between scene images and their 3D encoding, a random-shuffling strategy is implemented during training, preventing reliance on fixed image order and encouraging robust correspondence between generated content and scene context. This innovative approach allows the model to disentangle static scenes from dynamic content, jointly modelling a decoupled 3D structure and a dynamic subject.
To facilitate training, a scene-decoupled dataset was constructed using Unreal Engine 5, comprising videos with and without dynamic subjects, panoramic images representing static scenes, and corresponding camera trajectories. This dataset provides essential supervision for training scene-consistent generative models.
The study demonstrates state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and exhibiting generalisation across diverse environments, achieving improvements over methods limited to static scenes or small view variations. CineScene successfully generates scene-consistent videos featuring new dynamic subjects, even with substantial viewpoint changes.
Quantitative assessment of generated video fidelity and consistency
CineScene achieves a material pixel matching rate of 4617.51, demonstrating strong scene consistency in generated videos. Alongside this, the framework attains a CLIP-V score of 0.8633, indicating effective alignment between visual content and textual prompts. Further quantitative analysis reveals a PSNR of 14.5094 and an SSIM of 0.4133, both contributing to the assessment of video quality and structural similarity.
The research demonstrates rotational error of 2.6825 degrees and translational error of 5.1460 units, quantifying the accuracy of camera pose estimation within the generated sequences. Camera motion consistency, measured by CamMC, reached 6.8819, confirming stable and coherent camera movements throughout the videos.
Textual alignment, assessed using CLIP-T, achieved a score of 0.3212, validating the model’s ability to integrate textual descriptions with visual outputs. Evaluation on the VBench metric resulted in a score of 0.8053, providing a comprehensive measure of overall video quality. These results were obtained using a test set of 300 samples from the Scene-Decoupled Video Dataset, alongside 50 out-of-domain samples from DiT360.
The model was trained for 10,000 steps with a batch size of 16, employing a learning rate of 5 × 10−5 and a timestep shift of 15, utilising 20 scene context images, the target camera pose, and the target prompt as inputs. Implementation of a random-shuffling strategy for input scene images during training mitigated issues arising from position-aware priors within the video generation model, enabling improved alignment between pixel-level context and implicit 3D scene representation. This approach surpasses loss-guided methods by decoupling static backgrounds from dynamic foregrounds, facilitating vivid motion generation without constraints imposed by static reconstruction losses.
Decoupled Scene Encoding and Spatial Priors for Cinematic Video Synthesis
CineScene, a new framework for cinematic video generation, successfully creates scene-consistent videos with dynamic subjects and controlled camera movements. The system decouples scene context from subject motion, allowing for the synthesis of high-quality videos without requiring explicit three-dimensional scene modelling.
This is achieved by encoding static scene images into visual representations and injecting these as spatial priors into a pre-trained text-to-video generation model. The research demonstrates state-of-the-art performance in generating cinematic videos, effectively handling substantial camera movements and generalising across diverse environments.
A novel random-shuffling strategy applied to input scene images during training further enhances the model’s robustness. To facilitate this work, a new scene-decoupled video dataset was constructed using Unreal Engine 5, comprising videos with and without dynamic subjects, alongside corresponding camera trajectories and static scene imagery.
The authors acknowledge a limitation in the reliance on a pre-trained text-to-video model, which may introduce biases or constraints on the generated content. Future research could explore extending the framework to incorporate more complex scene interactions and subject behaviours, potentially leveraging larger and more varied datasets to improve generalisation and realism.
👉 More information
🗞 CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
🧠 ArXiv: https://arxiv.org/abs/2602.06959
