Predicting future events from videos presents a significant challenge for artificial intelligence, yet holds immense potential for applications beyond simple entertainment, particularly in areas requiring clear demonstration of complex procedures. Junhao Cheng from City University of Hong Kong, Liang Hou and Xin Tao from Kuaishou Technology, along with Jing Liao from City University of Hong Kong, now propose a new approach that moves beyond textual answers to predict what happens next in a video by generating a video response. This research introduces a system, named VANS, which combines visual understanding with video generation capabilities, and is driven by a novel method for aligning these two processes. By crafting a dedicated dataset and demonstrating state-of-the-art performance on benchmark tasks, the team unlocks a more intuitive and effective way for machines to not just tell us what will happen next, but to show us, paving the way for improved procedural learning and creative exploration.
Multimodal Video Understanding and Generation Research
Recent research advances video understanding and generation through multimodal models that integrate diverse information sources. These models analyze video content to extract meaning, create new videos from prompts, and combine text, images, video, and audio for improved performance. Key areas include instruction following, where models generate videos demonstrating tasks based on textual instructions, and reinforcement learning, used to refine video reasoning and generation capabilities. These advancements leverage large language models and vision transformers, relying on large-scale datasets for training and evaluation.
Techniques like reward-guided optimization and generative reward prediction optimization are employed to enhance video quality and coherence. Several models and projects are driving this progress, including Coin, a large dataset for instructional video analysis, and Cogvideox, a text-to-video diffusion model. DanceGRPO unlocks the potential of generative reward prediction optimization for visual generation, while DeepSeekMath pushes the boundaries of mathematical reasoning in language models. EventFormer predicts action-centric video events using a hierarchical attention transformer, and FVD provides a new metric for evaluating video generation quality.
GameFactory creates interactive videos for new games, and Gemini, a family of multimodal models from Google, demonstrates exceptional capabilities. Haploomni unifies video understanding and generation within a single transformer, and Koala-36m improves consistency between video content and fine-grained conditions. Mindomni leverages reward-guided optimization to enhance reasoning and generation, and Omni-Video democratizes unified video processing. ShowHowTo generates step-by-step visual instructions, and Stitch-a-Recipe creates videos from multi-step descriptions. UniVideo unifies video understanding, generation, and editing, and Video-GPT generates videos using diffusion models.
Video-RTS improves video reasoning through reinforcement learning and test-time scaling, and Wan offers advanced large-scale video generation. Current research trends emphasize scaling up models and datasets, integrating multiple modalities, and focusing on instruction following. Developing robust evaluation metrics for video generation remains a significant challenge. These advancements promise to unlock new possibilities in areas such as robotics, virtual reality, and content creation.
VANS Dataset Creation for Video Prediction
Researchers have pioneered a new approach to next-event prediction, moving beyond textual answers to dynamic video demonstrations through a system called VANS. To facilitate this, they constructed VANS-Data-100K, a dedicated dataset comprising 100,000 procedural and predictive video samples, each containing an input video, a question, and corresponding multi-modal answers. The dataset curation began with collecting raw video data from sources like COIN and YouCook2 for procedural tasks and general-scene videos for predictive scenarios, ensuring a diverse range of visual demonstrations. These raw videos underwent segmentation, utilizing timestamps for procedural content and automated detection for predictive content, with short segments filtered to ensure completeness.
To ensure data quality, the team employed Gemini-2. 5-Flash as an automated filter, selecting clips that best aligned with captions for procedural data and generating detailed captions to identify high-quality clips for predictive data. Following clip selection, Gemini-2. 5-Flash generated question-answer pairs, simulating diverse queries focused on logical next steps and “what-if” scenarios, alongside reasoning and ground-truth answers, all subject to logical verification. VANS integrates a Vision-Language Model (VLM) and a Video Diffusion Model (VDM), where the VLM processes the input question and video to generate a textual caption predicting the next event, serving as a guide for the VDM.
The VDM is conditioned on both the generated caption and visual cues extracted from input frames, enabling fine-grained visual correspondence during novel scene generation. Recognizing that the VLM and VDM were initially optimized independently, the researchers implemented Joint-GRPO, a reinforcement learning algorithm, to coordinate the two models. This algorithm aligns the VLM and VDM, ensuring they function as a cohesive unit for next-event prediction, addressing the gap between semantic understanding and visual representation, and improving both accuracy and visual fidelity.
Video Prediction via Vision-Language-Video Alignment
Scientists have pioneered Video-Next-Event Prediction (VNEP), a new approach to next-event reasoning that moves beyond textual answers to dynamic video demonstrations. This work introduces VANS, a system that aligns a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) using reinforcement learning, enabling the generation of video responses to questions about future events. The core of VANS is a Joint-GRPO strategy, which optimizes both models simultaneously, driving the VLM to produce captions that are easily visualized and guiding the VDM to generate videos faithful to those captions and the initial visual context. To facilitate this research, the team constructed VANS-Data-100K, a dedicated dataset comprising 100,000 video-question-answer triplets for training and evaluating models on the VNEP task.
This dataset is composed of 30,000 procedural samples and 70,000 predictive samples, sourced from datasets like COIN, YouCook2, and general-scene videos. The curation pipeline involves automated quality filtering using Gemini-2. 5-Flash, ensuring high-quality video segments and semantically representative question-answer pairs. Experiments demonstrate that VANS achieves state-of-the-art performance in both event prediction accuracy and the quality of the generated videos. The VANS architecture conditions the VDM on both the VLM-generated caption and low-level visual cues extracted from sampled input frames.
This design allows for fine-grained visual correspondence while generating novel scenes. The team employed a Joint-GRPO strategy to address the challenge of optimizing the VLM and VDM in isolation, ensuring that the VLM’s descriptions lead to visually plausible videos and that the VDM effectively coordinates textual and visual information. Results confirm that this approach delivers consistent performance on both semantic accuracy and visual fidelity.
Video Generation for Predictive Event Demonstration
This work introduces a new challenge in artificial intelligence, Video-Next-Event Prediction (VNEP), which extends the established task of next-event prediction by requiring systems to demonstrate answers through dynamic video generation, rather than textual descriptions. Researchers addressed the complexities of this task by developing VANS, a system that integrates a Vision-Language Model (VLM) with a Video Diffusion Model (VDM). The core of VANS is a two-stage reinforcement learning strategy, Joint-GRPO, which coordinates these models to produce accurate and visually consistent video demonstrations. To facilitate.
👉 More information
🗞 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
🧠 ArXiv: https://arxiv.org/abs/2511.16669
