Longlive Enables Real-time Interactive Long Video Generation with Frame-Level Autoregressive Models

Generating realistic and lengthy videos remains a significant challenge for artificial intelligence, but a new framework called LongLive offers a substantial leap forward in both quality and efficiency. Shuai Yang, Wei Huang, and Ruihang Chu, along with their colleagues, present a system that overcomes limitations in existing methods by generating videos frame-by-frame, allowing for real-time interaction and control. The team’s innovative approach incorporates a mechanism to refresh cached information with new prompts, ensuring smooth transitions and consistent narratives, and enables training on much longer videos than previously possible. This allows LongLive to generate minute-long videos in a remarkably short time and sustain fast frame rates, even supporting high-resolution output and quantized inference with minimal quality loss, opening up exciting possibilities for dynamic content creation and interactive experiences.

Long Video Generation via Efficient Fine-tuning

Scientists have developed LONGLIVE, a new method for generating long-form videos, up to four minutes in length, with improved consistency and responsiveness to instructions. Built upon a pre-trained model, LONGLIVE prioritizes efficient fine-tuning, reducing computational demands rather than requiring training from scratch. The core innovation lies in using LoRA, a technique that trains only a small subset of the model’s parameters, and a novel KV re-caching mechanism that preserves information across extended video sequences, allowing the model to maintain consistency and adhere to prompts throughout the entire video. To overcome the challenges of training on long videos, the team implemented a streaming long tuning strategy, aligning training and inference processes to maintain consistency and prevent quality degradation.

This approach allows the model to learn long-range dependencies effectively, improving fidelity and enabling efficient inference. Further enhancing speed, scientists combined short window attention with a frame-level attention sink, significantly accelerating processing while preserving long-range consistency and performance. Evaluations on the VBench benchmark demonstrate competitive performance, while user studies with thirty participants confirmed the improved video quality across several dimensions, including overall quality, motion, instruction following, and visual appeal. The method successfully generates videos up to 240 seconds long, demonstrating its capability for creating extended narratives. These findings highlight the potential of LONGLIVE for applications requiring high-quality, long-form video content.

Real-time Long Video Generation with Dynamic Control

LongLive represents a significant advancement in long video generation, addressing key challenges in both efficiency and visual quality. Researchers developed a frame-level autoregressive framework capable of generating minute-long videos, a substantial increase over previous methods, and achieved this with a comparatively small model size. The system utilizes a novel KV-recache technique to maintain visual consistency during interactive prompt changes, allowing for dynamic content creation where users can guide the narrative in real time. A key innovation lies in the combination of short window attention with a frame-level attention sink, which preserves long-range consistency while accelerating the generation process.

The team demonstrated that training directly on long videos is essential not only for achieving high-quality outputs but also for enabling efficient inference. During testing, LongLive achieved 20. 7 frames per second on a single GPU and successfully generated videos up to 240 seconds in length, while maintaining high fidelity and temporal coherence. Furthermore, the system supports INT8 quantization, reducing model size with minimal impact on performance. The authors acknowledge that the quality-efficiency trade-off remains a consideration, as larger attention windows improve consistency but increase computational demands. However, the frame-sink mechanism effectively mitigates this issue, achieving near-optimal consistency with a reduced memory footprint. Future work will likely focus on further optimizing this balance and exploring the potential for even longer video generation.

Real-time Long Video Generation with KV Recache

Scientists have presented LongLive, a new framework for generating long videos in real-time, achieving minute-long generation with a 1. 3 billion parameter model in just 32 GPU-days. This breakthrough addresses key challenges in long video generation, specifically balancing efficiency and quality, and enabling interactive control over the generated content. The system sustains 20. 7 frames per second on a single H100 GPU, and supports videos up to 240 seconds in length on the same hardware.

A central innovation is the KV recache mechanism, which addresses inconsistencies that arise when switching prompts during video generation. By recomputing the cached information using generated frames and the new prompt, the system effectively erases residual information from the previous prompt while maintaining visual continuity. This process ensures smooth transitions and accurate adherence to new prompts, removing visual discontinuities and semantic misalignment. The team integrated this recaching operation into the training loop, further aligning training and inference performance. To further improve long-term consistency, the researchers developed a streaming long tuning strategy.

This method trains the model on extended sequences by repeatedly conditioning on its own predictions, exposing it to the types of errors that accumulate during long video generation. By supervising each short segment of the extended sequence, the team mitigated content drift and improved fidelity. The team also demonstrated that LongLive supports INT8-quantized inference with minimal quality loss, enhancing its practical applicability.

👉 More information
🗞 LongLive: Real-time Interactive Long Video Generation
🧠 ArXiv: https://arxiv.org/abs/2509.22622

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Ai’s ‘time Blindness’ Revealed Despite Mastering What Videos Show

Ai’s ‘time Blindness’ Revealed Despite Mastering What Videos Show

February 6, 2026
Signals from the Universe’s First Stars Detectable with New Radio Telescope Strategy

Signals from the Universe’s First Stars Detectable with New Radio Telescope Strategy

February 6, 2026
Shows Rationale Extraction Improves DNN Performance with Limited Supervision and Feature Selection

Shows Rationale Extraction Improves DNN Performance with Limited Supervision and Feature Selection

February 6, 2026