The creation of realistic and coherent depictions of human movement remains a significant challenge in computer vision. Chengfeng Zhao from HKUST, Jiazhi Shu from SCUT, and Yubo Zhao from HKUST, alongside colleagues, address this problem by demonstrating the intrinsic link between generating 3D human motions and corresponding 2D videos. Their research introduces CoMoVi, a novel co-generative framework which simultaneously produces both motion and video within a unified denoising process, leveraging the strengths of pre-trained video diffusion models. This work is particularly noteworthy due to the introduction of a new 2D human motion representation and a large-scale dataset, CoMoVi Dataset, designed to facilitate research into complex and diverse human movements. Through extensive experimentation, the team proves CoMoVi’s effectiveness in advancing the state-of-the-art in both 3D motion and video generation.
Scientists Background
Scientists demonstrate a significant advancement in the co-generation of 3D human motion and realistic video sequences, revealing an intrinsic coupling between these two processes. The research establishes that 3D motions provide crucial structural priors for video plausibility and consistency, while pre-trained video diffusion models (VDMs) offer strong generalization capabilities for motion generation, necessitating a unified approach. This work introduces CoMoVi, a co-generative framework that synchronously generates both 3D human motions and videos within a single denoising loop, effectively bridging the gap between these traditionally separate domains. The team achieved this breakthrough by first developing an effective 2D human motion representation capable of inheriting the powerful priors of pre-trained VDMs.
This innovative representation compresses 3D motion information into pixel space, leveraging the temporal coherence and denoising capabilities already present in advanced video models. Subsequently, researchers designed a dual-branch diffusion model, extending the Wan2.2-I2V-5B architecture, to couple the generation processes with mutual feature interaction and 3D-2D cross-attention mechanisms. This allows for a continuous exchange of information, enhancing both the realism of the generated videos and the naturalness of the corresponding 3D motions. Further innovation comes in the form of the CoMoVi Dataset, a large-scale collection of approximately 50,000 high-resolution real-world human videos.
Each video is meticulously annotated with both text descriptions and motion labels, encompassing a diverse range of challenging human actions. Extensive experiments conducted on datasets including Motion-X++ and the VBench benchmark demonstrate the effectiveness of CoMoVi in both 3D human motion and video generation tasks, consistently outperforming state-of-the-art text-to-motion and image-to-video models. Experiments show that CoMoVi’s co-generative approach overcomes the limitations of existing cascaded methods, which typically address motion and video generation as separate problems. By integrating these processes, the research unlocks enhanced generalization for motion generation and improved structural guidance for video creation. This work opens new possibilities for applications in character animation, virtual and augmented reality, gaming, and any field requiring realistic and coherent human movement within dynamic visual environments.
Co-generation of 3D Motion and Video
The research detailed in this work addresses the inherent coupling between 3D human motion generation and 2D video creation, pioneering a co-generative framework named CoMoVi. Scientists engineered a system that simultaneously generates both 3D motions and videos within a unified denoising loop, leveraging the strengths of each modality. A key innovation lies in the development of a 2D human motion representation, designed to directly inherit the powerful priors established by pre-trained video diffusion models (VDMs). This approach effectively compresses 3D motion information into pixel space, facilitating seamless integration with video generation processes.
To achieve synchronous generation, the team designed a dual-branch diffusion model, extending the Wan2.2-I2V-5B architecture. This model couples the denoising processes for both 2D motion videos and RGB videos, employing mutual feature interaction to ensure consistent and coherent outputs. Furthermore, 3D-2D cross-attention modules were strategically inserted between diffusion blocks, enabling the generation of 3D human motion directly from features fused by 2D motion and RGB video latents. This cross-attention mechanism propagates the structural understanding of pre-trained VDMs to enhance 3D motion generation.
Crucially, the study pioneers the curation of the CoMoVi Dataset, a large-scale resource comprising approximately 50,000 high-resolution real-world human videos. Each video is meticulously annotated with both text descriptions and motion labels, encompassing a diverse range of challenging human actions. This dataset serves as the foundation for training and evaluating the CoMoVi framework, allowing for robust assessment of its capabilities. Extensive experiments were conducted on the Motion-X++ dataset, the VBench benchmark, and the newly created CoMoVi Dataset, evaluating performance in both motion and video generation. The results demonstrate that the CoMoVi framework consistently outperforms state-of-the-art text-to-motion and image-to-video models, achieving generalizable 3D human motion and realistic video generation concurrently. This methodological advancement unlocks improved control and fidelity in applications such as character animation and virtual reality.
Scientists Results
Scientists have achieved a breakthrough in the co-generation of 3D human motion and 2D video, demonstrating an intrinsic coupling between the two processes. The research team developed CoMoVi, a co-generative framework utilising dual video diffusion models (VDMs) to synchronously generate both 3D motions and videos within a single denoising loop. Experiments revealed that encoding 3D motion into the same space as pre-trained VDMs enables the concurrent generation of realistic and plausible human movement and corresponding video sequences. This innovative approach bypasses the limitations of previous cascaded pipelines that propagated errors and neglected the inherent relationship between 3D motion and 2D frames.
The core of this work lies in the creation of a novel 2D human motion representation, designed to inherit the powerful prior knowledge embedded within pre-trained VDMs. This representation encodes a 3D parametric body model into pixel space, preserving crucial 3D information while incorporating body part segmentation semantics. Researchers compressed vertex normals and body part semantics of 3D SMPL meshes into RGB images, allowing the system to effectively leverage existing video generation capabilities. The team curated the CoMoVi Dataset, a large-scale resource containing approximately 50,000 high-resolution real-world human videos, each meticulously annotated with both text and motion labels, covering a diverse range of challenging human actions.
Comprehensive tests conducted on the Motion-X++ dataset, the VBench benchmark, and the newly created CoMoVi Dataset validate the effectiveness of the method in both motion and video generation. Measurements confirm that CoMoVi outperforms state-of-the-art text-to-motion and image-to-video models. The system successfully generates generalizable 3D human motion and realistic human videos concurrently, demonstrating a significant advancement in the field. Specifically, the framework leverages a dual-branch diffusion model, based on Wan2.2-I2V-5B, incorporating mutual feature interaction and 3D-2D cross-attention modules to achieve this synchronicity.
Further technical accomplishments include the development of a colour encoding strategy that compresses body surface normals and part semantics into RGB channels. This allows for the direct encoding of 3D information into a format compatible with existing 2D video diffusion models. The team estimates initial 3D human motion from a starting image using CameraHMR, rendering the 3D SMPL mesh as a 2D motion representation. This representation, alongside the initial image and a text description, is then fed into the dual-branch diffusion model to generate a complete sequence of 3D motions and corresponding 2D video frames.
👉 More information
🗞 CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
🧠 ArXiv: https://arxiv.org/abs/2601.10632
