Researchers are tackling the challenge of efficiently processing extremely long sequences of data, a crucial capability for modern artificial intelligence applications. Yingfa Chen, Zhen Leng Thai, and Zihan Zhou, from Tsinghua University and OpenBMB, along with Zhu Zhang, Xingyu Shen, and Shuo Wang et al., present a new approach to hybrid attention mechanisms, combining the strengths of softmax attention and recurrent neural networks. Their work addresses a significant limitation in the field , the enormous computational cost of training such models from scratch, and the poor long-context performance of existing hybrid models. By introducing HALO, a distillation pipeline, and HypeNet, a novel hybrid architecture, they demonstrate a method for converting pre-trained models with just 2.3 billion tokens , a tiny fraction of the original pre-training data , achieving comparable performance alongside substantial gains in long-context efficiency.

Distilling Qwen3 into HypeNet via HALO offers promising

Scientists have developed a new pipeline, HALO (Hybrid Attention via Layer Optimization), to efficiently distill Transformer models into RNN-attention hybrid models, addressing a significant challenge in long-context modeling. The research team tackled the prohibitive cost of pre-training Hybrid architectures from scratch, which combine softmax attention blocks and recurrent neural networks, by focusing on transferring knowledge from existing pre-trained models. This innovative approach allows for the creation of hybrid models with performance comparable to their original Transformer counterparts, while simultaneously enhancing long-context performance and efficiency. The core of their work lies in converting the Qwen3 series into a new hybrid model, HypeNet, using just 2.3B tokens, less than 0.01% of the original pre-training data.

The study reveals a novel cross-architecture distillation procedure that selectively converts attention layers, ensuring optimal long-context performance, a critical area where hybrid models offer substantial inference speedups over traditional Transformer-based models. Researchers identified that existing transfer methods often require tens to hundreds of billions of tokens, remaining inaccessible to many academic teams, and frequently result in hybrid models with diminished long-context capabilities. To overcome these limitations, the team introduced HyPE (Hybrid Position Encoding), a position encoding scheme designed for strong length generalization within hybrid architectures. This scheme combines RoPE and NoPE, alongside an attention scaling mechanism, to improve performance across varying context lengths.

Experiments demonstrate that HypeNet achieves a superior performance-throughput tradeoff compared to the Qwen3 series, as illustrated in Figure 1, which showcases improved efficiency and performance with 128K context length and BFloat16 precision. The conversion process, facilitated by HALO, not only reduces the training data requirement dramatically but also results in a model that maintains comparable short-context performance while excelling in long-context tasks. Furthermore, the research establishes that HypeNet exhibits a time per output token improvement of 2.4× and 3.0× compared to Qwen3 at a 1M context length, where the original Qwen3 model encounters GPU memory limitations. This work presents a series of architectural improvements, validated through ablation studies on models exceeding 1B parameters, culminating in HypeNet, a novel hybrid architecture. The team’s contributions include a distillation procedure requiring fewer than 3B tokens, the innovative HyPE position encoding scheme, and a comprehensive set of architectural enhancements, collectively offering a pathway to more accessible and efficient long-context modeling. Table 1 highlights the advantages of HALO over existing attention-to-hybrid distillation methods, showcasing its significantly reduced token requirement of 2.3B compared to alternatives ranging from 7B to 400B tokens.

Transformer to RNN-attention hybrid distillation using HALO improves

Scientists developed HALO, a novel cross-architecture distillation procedure, to convert pre-trained Transformer models into RNN-attention hybrid models, addressing the limitations of existing methods that require substantial training data, over 10 billion tokens, and suffer from poor long-context performance. The research team implemented HALO by first selecting attention layers to remain unconverted, a crucial step to preserve long-context capabilities, and then distilling the remaining layers into RNN blocks via parameter transfer and knowledge distillation. This process required only 2.3 billion tokens, representing less than 0.01% of the original pre-training data used for the Qwen3 series, significantly reducing the computational burden for academic researchers. To further enhance long-context performance, the study pioneered Hybrid Position Encoding (HyPE), a novel position encoding scheme specifically designed for hybrid architectures.

Researchers engineered HyPE to exhibit strong length generalization, enabling the models to effectively process sequences beyond those encountered during training. Experiments employed a series of ablation studies on models exceeding 1 billion parameters to validate the efficacy of HyPE and other architectural improvements. The team meticulously evaluated the impact of each modification on both short-context and long-context tasks, ensuring a comprehensive understanding of their contributions. The conversion of the Qwen3 series into HypeNet involved a precise application of the HALO pipeline and the integration of HyPE and architectural refinements.

Performance was assessed by comparing HypeNet to the original Qwen3 Transformer models, focusing on both accuracy and efficiency. Measurements included throughput, quantified in tokens per second, and memory usage, reported in gigabytes, at a context length of 128K with BFloat16 precision. Results demonstrated that HypeNet achieved comparable performance to Qwen3 while exhibiting superior long-context capabilities and improved efficiency, as illustrated in Figure 1. Furthermore, the team measured the time per output token across varying context lengths, up to 1 million tokens, revealing that the Qwen3 model exhausted GPU memory at this length, whereas HypeNet maintained functionality. This demonstrates the method achieves a 2.4× to 3.0× speedup compared to the original Qwen3 models, particularly at extended context lengths, and highlights the potential of HALO and HypeNet to unlock more efficient long-context modeling.

HypeNet distillation boosts long-context efficiency by selectively transferring

Scientists have developed a new pipeline, HALO (Hybrid Attention via Layer Optimization), for efficiently distilling models into RNN-attention hybrid architectures. The research team successfully converted the Qwen3 series into a hybrid model, HypeNet, achieving performance comparable to the original models while demonstrating superior long-context performance and efficiency. This conversion was accomplished using only 2.3 billion tokens, representing less than 0.01% of the pre-training data required by previous methods. The breakthrough delivers a significant reduction in computational cost for long-context modeling.

Experiments revealed that HypeNet incorporates a novel position encoding scheme, HyPE (Hybrid Position Encoding), which combines RoPE and NoPE with an attention scaling mechanism. Measurements confirm that HyPE achieves superior length generalization, a critical factor for handling extended sequences. Data shows that the team also implemented a series of architectural improvements, validated through ablation experiments on models exceeding 1 billion parameters, further enhancing the performance-throughput tradeoff. These improvements collectively contribute to HypeNet’s ability to process longer contexts more effectively.

The team measured the efficiency of HALO against existing attention-to-hybrid distillation methods, demonstrating a substantial reduction in training token requirements. Results demonstrate that HALO requires fewer than 3 billion tokens, significantly less than the 7 billion to 400 billion tokens demanded by methods like Mamba-in-the-Llama, SMART, and Jet-Nemotron. Tests prove that the novel cross-architecture distillation procedure improves model efficiency in long-context scenarios. Furthermore, scientists developed an efficient attention layer selection method to determine which attention layers to retain unconverted, ensuring optimal long-context performance. The combination of HALO, HyPE, and architectural improvements resulted in HypeNet models with a demonstrably better performance-throughput tradeoff, as illustrated in Figure 1 of the study. The research establishes a new benchmark for hybrid model creation, offering a pathway to more accessible and efficient long-context modeling solutions.

HALO and HypeNet enhance long-context efficiency with sparse

Scientists have developed a new pipeline, HALO (Hybrid Attention via Layer Optimization), for efficiently converting pre-trained Transformer models into RNN-attention hybrid architectures. This process requires significantly less training data, less than 3 billion tokens, compared to previous methods which demanded over 10 billion tokens. The researchers also introduced HypeNet, a novel hybrid architecture incorporating a new position encoding scheme called HyPE, designed to improve performance on long-context tasks. Applying HALO and HypeNet to the Qwen3 series of models resulted in hybrid models that maintained comparable performance to the original models, while demonstrating superior long-context performance and increased efficiency.

Specifically, HypeNet achieved up to a 3.0x decoding speedup and a 3.4x prefilling speedup at a 512K context length, and maintained functionality up to a 1M context length where the original Qwen3-1.7B model ran out of GPU memory. This work offers a cost-effective approach to building and validating hybrid architectures for long-context language models, potentially enabling applications like long-horizon reasoning and agentic behaviours. The authors acknowledge that their conversion process, trained on the FineWeb-Edu corpus, may diminish instruction-following and alignment behaviours present in the original pre-trained models, a common limitation of existing distillation methods. Furthermore, the current conversion protocol is specifically designed for Transformer-based architectures, and its applicability to other model types requires further investigation. Future research could focus on efficiently recovering the capabilities of the base models after conversion, and exploring the adaptation of this protocol to a wider range of architectures, although the majority of current large language models are Transformer-based.

👉 More information
🗞 Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
🧠 ArXiv: https://arxiv.org/abs/2601.22156

Tags:

HALO pipeline Hybrid Models HyPE position encoding Knowledge Distillation length generalization. long-context modeling parameter transfer Qwen3 Recurrent Neural Networks Softmax Attention

Halo Achieves 0.01% Loss with Hybrid Linear Attention for Long Contexts

Distilling Qwen3 into HypeNet via HALO offers promising

Transformer to RNN-attention hybrid distillation using HALO improves

HypeNet distillation boosts long-context efficiency by selectively transferring

HALO and HypeNet enhance long-context efficiency with sparse

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability