Lol: Advances Hour-Long Video Generation, Resolving Sink-Collapse with RoPE Jitter

Scientists are tackling the persistent problem of coherence in long-form video generation, where current autoregressive models frequently falter due to error accumulation. Justin Cui from UCLA and ByteDance Seed, alongside Jie Wu and Ming Li, with contributions from Tao Yang, Xiaojie Li, and Rui Wang from ByteDance Seed, have identified a critical flaw , ‘sink-collapse’ , causing generated videos to repeatedly reset to a static frame. Their research demonstrates that this collapse stems from a conflict within the model’s architecture, specifically between Rotary Position Embedding and multi-head attention. Crucially, they present a novel, training-free technique employing ‘RoPE jitter’ to disrupt problematic patterns and maintain video quality over extended durations. This work represents a significant leap forward, achieving , to the best of their knowledge , the first demonstration of real-time, streaming video generation lasting up to 12 hours with minimal quality loss.

Crucially, they present a novel, training-free technique employing ‘RoPE jitter’ to disrupt problematic patterns and maintain video quality over extended durations.

RoPE conflicts cause video ‘sink-collapse’ and visual artifacts

To combat sink-collapse, the study unveils a lightweight, training-free approach: multi-head RoPE jitter. By disrupting the synchronized focus on sink frames, the method successfully alleviates the repetitive behaviour without compromising the overall quality of the generated video. Experiments confirm that this jitter effectively suppresses sink-collapse, allowing for sustained coherence over extended durations. The research establishes a novel understanding of the interplay between RoPE, multi-head attention, and the stability of autoregressive video generation. As a compelling illustration of this robustness, they generated continuous videos lasting up to 12 hours, representing a significant leap beyond previous achievements in streaming video generation.
The team integrated streaming RoPE generation, noise sampling, and a 3D causal VAE decoder at inference, enabling continuous video creation with sustained quality, all while utilising models of just 1.3 billion parameters and KV cache. This advancement opens exciting possibilities for applications requiring extended, dynamically generated video content, such as immersive simulations, long-form storytelling, and real-time virtual environments. Furthermore, the analysis revealed that sink-collapse events coincide with local maxima in phase alignment around sink frames, indicating a multi-dimensional origin for the problem. The team’s investigation showed that, unlike bidirectional models, sink-collapse in autoregressive settings doesn’t exhibit a simple periodic behaviour, necessitating a more nuanced solution. By shifting the frequencies of attention heads, the researchers effectively disrupted the conditions leading to sink-collapse, paving the way for indefinitely long, coherent video streams. This innovative approach promises to unlock new creative avenues and practical applications for generative video technology.

Sink-collapse analysis in autoregressive video generation reveals critical

Researchers meticulously examined attention sink frame usage, discovering that despite differing training paradigms, these models consistently exhibited sink-collapse, prompting a detailed investigation into its origins. The study pioneered an analysis of temporal dynamics, revealing that sink-collapse isn’t driven by a single periodicity, unlike observations in bidirectional models like RIFLEx; instead, collapse points correlate with local maxima when summing phase alignment around sink frames across all temporal dimensions. To address this, the team engineered a lightweight, training-free approach: multi-head RoPE jitter. Experiments employed existing autoregressive video generation models, specifically LongLive and Self-Forcing++, to demonstrate the efficacy of their method. The system delivers substantial improvements in long-term coherence, successfully alleviating sink-collapse while preserving overall generation quality. The approach enables robust, extended video generation, showcasing a significant advancement in autoregressive modelling and opening new avenues for creating truly long-form visual content.,.

Sink-collapse identified and resolved in video generation

Experiments revealed that both LongLive and Self-Forcing++ exhibited sink-collapse at identical latent frame indices of 132 and 201, irrespective of input noise or prompts. Measurements confirm that shifting the base frequencies of different attention heads effectively reduces this homogenization, preventing the repetitive patterns. The team measured intra-head phase concentration, revealing that sink-collapse events coincide with local maxima in normalized L2 distance to sink frames, indicating a strong correlation between phase alignment and collapse points. Furthermore, the implementation of streaming RoPE generation and noise sampling, combined with a 3D causal VAE decoder, enabled this sustained, high-quality continuous video generation using models of only 1.3 billion parameters and KV cache. This breakthrough delivers a solution to a significant challenge in autoregressive video generation, extending video length from minutes to indefinite streams without quality degradation, a substantial advancement for applications ranging from cinematic content creation to long-duration simulations. The work systematically analyses sink collapse and reveals its origins in autoregressive long video generation, offering a novel approach to extend streaming generation indefinitely.

RoPE Jitter Fixes Video Generation Collapse

To resolve this, the team proposed a training-free method employing multi-head RoPE jitter, which disrupts homogenization between attention heads and effectively mitigates long-horizon collapse. Extensive experimentation demonstrated that this approach successfully alleviates sink-collapse without compromising generation quality. The authors acknowledge a limitation in the reliance on models of a specific size (1.3B parameters) and KV cache size (0.2), which may influence scalability to even larger video lengths or resolutions. Future research could explore the application of this jitter technique to other model architectures and investigate its effectiveness with varying model sizes and computational resources. This advancement is significant as it enables the creation of substantially longer and more coherent videos than previously possible, opening avenues for applications in continuous media, simulations, and creative content generation.

👉 More information
🗞 LoL: Longer than Longer, Scaling Video Generation to Hour
🧠 ArXiv: https://arxiv.org/abs/2601.16914

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Computing Achieves Precise Machine Failure Detection Using 133 Qubits

Quantum Computing Achieves Precise Machine Failure Detection Using 133 Qubits

January 28, 2026
Molecular Language Model Achieves 100x Faster Quantum Hamiltonian Prediction

Molecular Language Model Achieves 100x Faster Quantum Hamiltonian Prediction

January 28, 2026
Devprompt Achieves One-Normal Shot Image Anomaly Detection with Deviation Guidance

Devprompt Achieves One-Normal Shot Image Anomaly Detection with Deviation Guidance

January 28, 2026