Scientists are tackling a critical challenge in artificial intelligence: the limitations of scaling large language models (LLMs). Chen Chen and Lai Wei, both from ByteDance Seed, alongside their colleagues, demonstrate that simply making models wider or extending their context length is yielding diminishing returns, while increasing depth , theoretically more powerful , has proven difficult to achieve reliably. Their research revisits the Post-LayerNorm (Post-LN) method, previously abandoned due to instability, and introduces ‘Keel’, a novel architecture incorporating a Highway-style connection to address gradient vanishing in deep networks. This innovation allows for stable training at depths exceeding 1000 layers, consistently outperforming Pre-LN and suggesting that Post-LN, when combined with Keel, offers a surprisingly simple yet effective pathway towards building truly deeply scalable LLMs , potentially even infinite-depth models.
Researchers at ByteDance have unveiled Keel, a novel Transformer architecture that enables stable training at extreme depths, exceeding 1000 layers, and unlocks superior expressivity compared to current methods. The team achieved this by revisiting the Post-LayerNorm (Post-LN) formulation, previously abandoned due to instability at scale, and identifying the root cause of its failure: the ResNet-style residual pathway which introduces gradient vanishing in deep networks. This work presents a fundamental shift in LLM architecture, offering a pathway beyond the diminishing returns of conventional scaling techniques.
The study reveals that the central problem with Post-LN stems from how residual and transformed activations are mixed before normalization, leading to unstable gradient signals. To resolve this, the researchers replaced the standard residual path with a Highway-style connection in their Keel architecture. This modification preserves gradient flow, preventing signal vanishing from the top layers to the bottom and allowing for stable training at unprecedented depths. Unlike previous attempts to revive Post-LN, Keel requires no specialized initialization or complex optimization tricks, streamlining the training process and making deep LLMs more accessible.
Experiments show Keel consistently improves perplexity and depth-scaling characteristics over Pre-LN, the currently dominant approach. This breakthrough is substantiated by empirical results demonstrating Keel’s robust training at depths exceeding 1000 layers. The research establishes that Keel maintains smooth convergence even with aggressive learning rates, specifically 4.5×10⁻³, while Pre-LN exhibits severe instability under the same conditions. Furthermore, Keel consistently outperforms Pre-LN across all depths, ranging from 64 to 1024 layers, as illustrated in Figure 1(c). The team’s analysis, grounded in formal gradient dynamics, proves that the Highway-style connection provides provable control of gradient magnitudes, enabling signals to propagate through depth without vanishing.
The impact of Keel extends beyond mere scalability; the study unveils significant gains in model expressiveness across various capabilities. Performance benchmarks demonstrate a +6.6% improvement in Multilingual Understanding, a +4.4% increase in General Knowledge & Commonsense, and a remarkable +16.5% boost in Math & Code, showcasing Keel’s ability to enhance performance in specialized domains. This research opens the possibility for future infinite-depth architectures, potentially unlocking qualitatively new behaviors in LLMs and establishing a simple, effective foundation for building deeply scalable models.
Keel Transformer Stabilises 1000-Layer Deep Networks effectively
Scientists are confronting limitations in large language model (LLM) scaling, observing diminishing returns from widening models and extending context length. Researchers in this work revisited the Post-LayerNorm (Post-LN) formulation, previously abandoned due to instability at scale, and identified the ResNet-style residual pathway as the primary cause of gradient vanishing in deep networks. To address this, the study pioneered Keel, a Post-LN Transformer that replaces the conventional residual path with a Highway-style connection, preserving gradient flow and preventing signal vanishing from top to bottom layers. The team engineered Keel to enable stable training at depths exceeding 1000 layers, a feat previously unattainable with standard architectures.
Experiments employed a rigorous training regime with a learning rate of 4.5×10³ to demonstrate Keel’s robustness and convergence speed, as illustrated in Figure 1(a), which shows Keel maintaining smooth convergence while Pre-LN exhibits severe instability. This innovative approach achieves stable optimization of ultra-deep networks without requiring specialized initialization or complex optimization tricks, unlike prior methods. Furthermore, the study assessed Keel’s expressiveness across multiple capability domains, Multilingual Understanding, General Knowledge & Commonsense, Math & Code, revealing consistent improvements over Pre-LN, particularly a +16.5% gain in Math & Code performance (Figure 1(b)). The researchers meticulously measured average benchmark scores across these domains to quantify Keel’s enhanced capabilities. To demonstrate depth scaling, the team trained models with varying layer counts (64 to 1024) and observed that Keel consistently outperformed Pre-LN, achieving a 60.9% average benchmark score at 1024 layers compared to Pre-LN’s lower score (Figure 1(c)). This work indicates that Keel’s architectural improvements unlock a simple and effective foundation for building deeply scalable LLMs, potentially paving the way for infinite-depth architectures.
Keel overcomes gradient vanishing in deep LLMs
Scientists have developed Keel, a novel Post-LayerNorm (Post-LN) architecture that enables stable training of Large language models (LLMs) at depths exceeding 1000 layers. The research addresses the limitations of current LLM scaling, where widening models and extending context length yield diminishing returns, by focusing on depth scaling as a more promising path forward. Experiments revealed that the central failure mode of Post-LN stems from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks, hindering effective training. The team measured gradient dynamics and formally demonstrated that the ResNet-style residual path is the primary source of gradient vanishing, not normalization itself.
To overcome this, Keel replaces the traditional residual path with a Highway-style connection, preserving gradient flow and preventing signal vanishing from top to bottom layers. Tests prove that this modification stabilizes Post-LN at scale without requiring specialized initialization or complex optimization tricks, a significant breakthrough in LLM architecture. Data shows Keel maintains smooth convergence at aggressive learning rates, unlike Pre-LN which exhibits severe instability under the same conditions. Results demonstrate that Keel consistently outperforms Pre-LN across all depths, ranging from 64 to 1024 layers.
Specifically, the model achieves a +16.5% performance increase in Math & Code capability domains compared to Pre-LN baselines. Measurements confirm that Keel’s architectural improvements enable stable optimization of ultra-deep networks with enhanced learning efficiency and model expressiveness. The breakthrough delivers a simple and effective foundation for building deeply scalable LLMs, potentially unlocking infinite-depth possibilities. Scientists recorded that Keel’s Highway-style gated connection dynamically balances carry and transform signals, regulating both forward and backward information flow. This allows for provable control of gradient magnitudes, enabling signals to propagate through depth without vanishing, a critical achievement for training extremely deep networks. The study establishes a practical framework for the next generation of LLM scaling, effectively addressing the training stability issues associated with traditional deep architectures and opening new avenues for expressivity per parameter.
Keel unlocks stable training in ultra-deep LLMs
Scientists have demonstrated that increasing the depth of large language models (LLMs) offers a promising avenue for improving expressivity, a characteristic currently hampered by training instability at extreme depths. Researchers revisited the Post-LayerNorm (Post-LN) formulation, previously superseded by Pre-LN due to scaling issues, and identified the ResNet-style residual pathway as the primary cause of gradient vanishing in deep networks. To address this, they introduced Keel, a Post-LN Transformer incorporating a Highway-style connection to preserve gradient flow and enable stable training at depths exceeding 1000 layers. Keel consistently outperforms Pre-LN baselines in perplexity and depth-scaling, maintaining dominance even after fine-tuning on challenging reasoning benchmarks like BBH, MMLU-Pro, and CMMLU.
This architectural improvement translates directly to downstream tasks, allowing models to adapt to complex instructions without significant performance loss. The authors acknowledge that training instability isn’t solely driven by depth, and wider models may require additional stabilization mechanisms. Future work will investigate stability under width scaling and explore the effectiveness of Keel in low-data regimes, as substantial training data is currently needed for optimal performance. These findings suggest that depth, facilitated by innovations like Keel, represents a viable path towards building deeply scalable LLMs and potentially achieving infinite-depth architectures.
👉 More information
🗞 Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
🧠 ArXiv: https://arxiv.org/abs/2601.19895
