The quest to understand how large language models learn and generalise remains a central challenge in artificial intelligence, despite the observed improvements with increased computing power. Chiwun Yang from Sun Yat-sen University, and colleagues, now present a unified theoretical framework that illuminates the learning process within transformer networks, moving beyond purely empirical observations. The researchers model transformer learning as a continuous system, allowing them to rigorously analyse how models improve during training with realistic data, and to predict the relationship between computational resources and final performance. This work establishes a clear understanding of how excess risk, a measure of learning error, changes as resources scale, revealing a distinct transition between rapid initial improvement and a slower, power-law decay, and ultimately providing isolated scaling laws for model size, training time and dataset size.

The research investigates the process linking optimisation to kernel behaviours. Departing from previous analyses using simplified models, the team rigorously examines stochastic gradient descent (SGD) training for multi-layer transformers applied to sequence-to-sequence data, closely replicating real-world conditions. This analysis characterises the convergence of generalisation error towards the irreducible risk as computational resources scale with data volume, with particular focus on the optimisation process itself. Researchers establish a theoretical upper bound on excess risk, identifying a distinct phase transition in performance, where excess risk initially decreases exponentially with computational cost, but transitions to a power-law decay once a specific resource threshold is reached.

Scaling Laws for Machine Learning Training

Scientists have developed a theoretical understanding of how machine learning model performance changes as resources like data, model size, and computing power are varied. These scaling laws reveal that training performance isn’t a simple linear function of increased resources, but instead exhibits different regimes where one resource becomes the limiting factor, such as being compute-starved, data-limited, or model-limited. A constant, denoted as ξ, represents the inherent difficulty of the learning task and influences how quickly performance improves with more resources, while the Lambert W function describes complex relationships between these resources. The team presents theorems describing how generalization error scales with data size, model size, and compute under different conditions.

A central result divides the scaling trend into two stages: an initial compute-starved phase where error decreases exponentially with increasing compute, and a subsequent data-limited phase where error scales according to a more complex formula involving the Lambert W function. Further theorems outline how to optimize performance by adjusting data, model size, and compute, demonstrating that increasing the limiting resource yields the greatest improvement. These scaling laws provide guidance on allocating limited resources to achieve the best possible performance, predicting how performance will improve with increased resources, and identifying bottlenecks in the training process. Knowing which regime a model is in allows researchers to focus on the most impactful resource, and the analysis highlights the importance of balancing data, model size, and compute for efficient machine learning systems.

Transformer Learning, Scaling Laws, and Optimisation Dynamics

Scientists have established a rigorous mathematical framework to understand how the performance of transformer-based language models improves with increased computational resources, moving beyond purely empirical observations. The work formalizes the learning process as an ordinary differential equation, then approximates it using kernel methods, allowing for detailed analysis of stochastic gradient descent training. Experiments reveal that generalization error converges to an irreducible minimum as computational resources scale with data, particularly during the optimization phase, and the team has defined a distinct phase transition governing this process. This unified framework delivers isolated scaling laws for model size, training time, and dataset size, demonstrating how each variable independently governs the upper bounds of generalization performance.

The researchers established a theoretical upper bound on excess risk, characterized by this phase transition, and confirmed the stability of the process through careful mathematical analysis. Measurements confirm that a model’s performance is strongly linked to the width of its layers and the radius of weight updates, with larger layer widths and smaller update radii contributing to faster convergence. Further analysis establishes that the approximation error, measuring the model’s ability to represent the target function, is bounded by the inverse of model size, given an arbitrary dataset size and infinite training time. The team’s findings provide a foundation for optimizing the design and training of future language models, offering quantifiable upper bounds on generalization error and highlighting the interplay between model size, dataset size, and training time.

Scaling Laws and Generalization in Transformers

This research establishes a comprehensive theoretical framework for understanding the scaling laws observed in large language models, specifically examining the relationship between computational resources and model performance. By modelling the training process of transformer architectures as a mathematical system, scientists demonstrate how generalization error converges as computing power and data scale, identifying two distinct phases of improvement: an initial exponential decay followed by a power-law decay. The team rigorously characterizes excess risk, establishing an upper bound influenced by both computational cost and data characteristics. Furthermore, this work clarifies the independent roles of model size, training time, and dataset size in determining performance limits, offering insights into how each factor contributes to overall model capability. The analysis reveals that simply increasing model size does not guarantee continued improvement, particularly when the model becomes significantly larger than the complexity of the data itself, suggesting a point of diminishing returns. While the findings confirm the general trend of improved performance with increased resources, the authors acknowledge limitations related to dataset noise and model capacity, suggesting areas for future research to optimize resource allocation for large language model development.

👉 More information
🗞 Unifying Learning Dynamics and Generalization in Transformers Scaling Law
🧠 ArXiv: https://arxiv.org/abs/2512.22088

Tags:

excess risk Generalization Error irreducible risk kernel behaviours ordinary differential equations Phase transition Scaling Laws sequence-to-sequence data Stochastic Gradient Descent transformer models

Advances in Transformer Learning Enable Characterization of Generalization Risk with Scaling Data

Scaling Laws for Machine Learning Training

Transformer Learning, Scaling Laws, and Optimisation Dynamics

Scaling Laws and Generalization in Transformers

Rohail T.

Latest Posts by Rohail T.:

Quantum Simulations Gain Accuracy with Newly Found ‘Trotter Scar’ States

Confined Waves Shrink to Just 0.75 Units with New Material Design

New analysis reveals non-classical features in quantum measurements