Anyview Achieves Dynamic View Synthesis from 2D, 3D and 4D Data Sources

Scientists are tackling the challenge of creating realistic videos of dynamic scenes from entirely new viewpoints. Basile Van Hoorick, Dian Chen, and Shun Iwase, all from Toyota Research Institute, alongside Pavel Tokmakov, Muhammad Zubair Irshad, and Igor Vasiljevic et al, present AnyView, a novel video generation framework designed to synthesise convincing footage even in complex, rapidly changing environments. This research is significant because current generative video models often falter when maintaining consistency across multiple views and over time, particularly in real-world scenarios , AnyView overcomes this limitation by learning a generalisable spatiotemporal representation from diverse data sources. Furthermore, the team introduces AnyViewBench, a demanding new benchmark that highlights AnyView’s superior performance in extreme dynamic view synthesis, where existing methods demonstrably struggle to deliver plausible results.

This breakthrough addresses a critical challenge in generative video, where existing models often struggle with realistic depictions when viewed from drastically different angles or during rapid motion. The research team achieved this by developing a diffusion-based system capable of synthesising videos from any chosen perspective, conditioned on a single input video, without relying on explicit scene reconstruction or computationally expensive optimisation techniques. AnyView leverages a generalist spatiotemporal implicit representation, trained using diverse datasets, monocular (2D), multi-view static (3D), and multi-view dynamic (4D), to produce zero-shot novel videos from arbitrary camera locations and trajectories.

The core innovation lies in AnyView’s ability to generate plausible and consistent videos even with limited visual information and extreme camera movements, a feat that currently challenges state-of-the-art methods. Researchers trained the model by combining large-scale internet data with multi-view geometry and camera control, utilising twelve multi-domain 3D and 4D datasets to enhance its understanding of scene dynamics and object permanence. This approach allows AnyView to implicitly learn scene completions, respecting geometry, physics, and object behaviour, even when there is minimal overlap between the input and target viewpoints. Experiments demonstrate that AnyView surpasses existing baselines, which often fail to extrapolate beyond the input view or require significant viewpoint overlap to maintain performance.
To rigorously evaluate AnyView’s capabilities, the team also introduced AnyViewBench, a challenging new benchmark specifically designed for extreme dynamic view synthesis. This benchmark features diverse real-world scenarios, including driving, robotics, and human activity, with varying camera rigs and motion patterns, providing a standardised platform for assessing performance under demanding conditions. Results on AnyViewBench reveal that most existing methods drastically degrade in performance when faced with significant viewpoint changes, while AnyView consistently produces realistic, plausible, and spatiotemporally consistent videos from any viewpoint. This advancement opens doors for applications in robotics, world modelling, simulation, telepresence, VR/AR, and autonomous driving, where realistic and stable 4D video representations are crucial.

Furthermore, the study establishes a new paradigm for dynamic view synthesis, moving away from reliance on explicit 3D reconstructions and costly optimisation techniques. By employing a diffusion framework with minimal inductive biases, AnyView learns to synthesise unobserved content implicitly, guided by large-scale training data and dense ray-space conditioning. This allows the model to support any camera model, including non-pinhole cameras, and to generate videos with high fidelity and semantic consistency. The research demonstrates that a powerful prior over shapes, semantics, materials, and motion can be learned from limited information, enabling the creation of realistic and viewpoint-invariant video representations, a capability mirroring human visual perception and inference.

Diffusion-based AnyView for dynamic view synthesis enables realistic

Scientists developed AnyView, a diffusion-based video generation framework designed to tackle dynamic view synthesis with minimal pre-defined assumptions about scene geometry. The research team trained a generalist spatiotemporal implicit representation by leveraging multiple data sources, including monocular 2D, multi-view static 3D, and multi-view dynamic 4D datasets, enabling zero-shot novel video creation from any camera location or trajectory. This innovative approach circumvents the need for explicit scene reconstruction or computationally expensive test-time optimisation techniques, a significant departure from existing methods. The study pioneered a method for consistent extreme monocular dynamic view synthesis, operating end-to-end without relying on explicit 3D reconstruction or costly optimisation.

Experiments employed dense ray-space conditioning to provide camera parameters, supporting various camera models and allowing the network to learn unobserved content implicitly, guided by large-scale datasets. Researchers found that current baselines often fail to extrapolate beyond the input view, while AnyView preserves scene geometry, appearance, and dynamics even with drastically different target poses and incomplete visual observations. To rigorously evaluate AnyView, the team proposed AnyViewBench, a new benchmark specifically designed for extreme dynamic view synthesis in diverse real-world scenarios. This benchmark challenges existing methods, which typically degrade in performance when significant viewpoint changes occur, as they require substantial overlap between input and target views.

In contrast, AnyView consistently generates realistic, plausible, and spatiotemporally consistent videos from any viewpoint, demonstrating a substantial advancement in the field. The system delivers a solution to the problem of generating realistic 4D video representations, prioritising temporal stability and self-consistency over exact ground truth correspondence. Furthermore, the work addresses the challenge of generating videos from arbitrary camera perspectives while the scene is in motion, acknowledging the inherent under-constrained nature of the task. The approach enables the creation of reasonable scene completions based on a single input video, respecting scene geometry, physics, and object permanence, even with limited overlap between views. This innovation is crucial for applications like robotics, world models, and autonomous driving, where robustness to shifting camera poses is paramount.

👉 More information
🗞 AnyView: Synthesizing Any Novel View in Dynamic Scenes
🧠 ArXiv: https://arxiv.org/abs/2601.16982

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Cubesat Missions Detect 360 Gamma-Ray Transients with GRBAlpha & Vzlusat-2

Cubesat Missions Detect 360 Gamma-Ray Transients with GRBAlpha & Vzlusat-2

January 28, 2026
Automated Pipeline Achieves Graph-Based Intelligence for Performance-Driven Building Design

Automated Pipeline Achieves Graph-Based Intelligence for Performance-Driven Building Design

January 28, 2026
Gnss-Based Lunar Orbit Estimation Achieves High-Precision with Stochastic Cloning UD Filter

Gnss-Based Lunar Orbit Estimation Achieves High-Precision with Stochastic Cloning UD Filter

January 28, 2026