Researchers are increasingly focused on improving the adaptability of large language models (LLMs), rather than solely optimising base performance. Tessa Han from the Broad Institute, Schmidt Center, Sebastian Bordt from the University of Tübingen, Tübingen AI Center, Hanlin Zhang and Sham Kakade from Harvard University demonstrate that weight decay, a crucial regularisation parameter during pretraining, significantly enhances this adaptability, or ‘plasticity’. Their collaborative study, involving researchers working across the Broad Institute, University of Tübingen and Harvard University, reveals that models trained with higher weight decay values exhibit greater performance gains when fine-tuned for specific downstream tasks. This finding is significant because it highlights the potential for counterintuitive trade-offs, where a model initially performing less well after pretraining can ultimately outperform others following fine-tuning, and underscores the need to consider metrics beyond simple cross-entropy loss when optimising LLM hyperparameters.
Researchers are redefining how large language models are developed, shifting focus from minimising initial training loss to maximising a model’s ability to learn new tasks; this work introduces the concept of ‘plasticity’, the ease with which a pre-trained language model adapts during fine-tuning, and demonstrates a surprising link between weight decay and this crucial adaptability. Through systematic experimentation, the study reveals that increasing weight decay during pre-training can improve a model’s performance on downstream tasks, even if it initially performs worse according to standard metrics, challenging the conventional wisdom that prioritizes minimising pre-training loss as the sole indicator of model quality. The research team investigated weight decay’s influence across two language model families, Llama-2 and OLMo-2, scaling up to models with 4 billion parameters. Pre-training was conducted under both compute-optimal and overtrained conditions, with fine-tuning evaluated on six diverse Chain-of-Thought tasks. Performance was assessed using a comprehensive suite of metrics, encompassing both the correctness and quality of generated solutions, offering a more holistic evaluation than traditional methods. Specifically, the study demonstrates that larger weight decay values encourage the development of linearly separable representations within the model, effectively regularizing attention matrices and reducing overfitting on the training data, sustaining plasticity. The findings suggest that standard hyperparameter choices may need re-evaluation to better account for downstream adaptability, potentially unlocking significant performance gains in large language models. This work establishes the importance of evaluating beyond cross-entropy loss when optimising hyperparameters and illuminates the complex role a single parameter plays in shaping model behaviour. A systematic investigation of weight decay during language model pretraining forms the core of this work, motivated by findings in vision models suggesting regularization enhances plasticity. The research focuses specifically on the AdamW optimizer, a widely used algorithm for training large language models, and its weight decay hyperparameter, denoted as λ. AdamW performs decoupled updates, first adjusting model parameters based on gradient information, then applying a weight decay step that penalizes large weights, repeated for each optimizer step, t ≥1. To assess model plasticity, the study employed a fine-tuning paradigm where pretrained models were adapted to various downstream tasks. Llama-2 and OLMo-2 models were pretrained on the FineWeb-Edu and OLMo-Mix-1124 datasets respectively, with model sizes ranging from 0.5 billion to 4 billion parameters. Experiments were conducted with two distinct training regimes, a 20 TPP (tokens per parameter) Chinchilla-optimal ratio and a 140 TPP overtrained ratio, resulting in five models: Llama-2-0.5B-20x, Llama-2-1B-20x, Llama-2-4B-20x, OLMo-2-1B-20x, and OLMo-2-1B-140.
At 20 TPP, models Llama-2-{0.5B and 1B}-20x and OLMo-2-1B-20x achieved optimal pretraining validation loss with a weight decay of 0.5, while Llama-2-4B-20x reached its lowest loss at a weight decay of 1.0. Conversely, at 140 TPP, the OLMo-2-1B-140x model demonstrated the lowest validation loss using the default weight decay value of 0.1, a departure from the trend observed at 20 TPP. Further analysis revealed that models pretrained with higher weight decay values exhibit improved plasticity, demonstrated by gains in downstream task performance. Fine-tuning experiments on six CoT tasks showed that models trained with weight decay exceeding the standard default of 0.1 consistently outperformed their counterparts. Specifically, in the 20 TPP regime, an optimal weight decay of 1.0 was identified for Llama-2-0.5B-20x, Llama-2-1B-20x, Llama-2-4B-20x, and OLMo-2-1B-20.
Scientists have long understood that simply scaling up large language models isn’t enough to guarantee improved performance on specific tasks. The relentless pursuit of bigger models, while yielding impressive results, often overlooks adaptability. This research shifts the focus from achieving the lowest possible initial loss during pre-training to understanding how easily a model can be refined for real-world applications. The finding that increased weight decay encourages ‘plasticity’ is particularly compelling, demonstrating its ability to shape the internal representations within the model, promoting linearity and simplifying attention mechanisms. While a model trained with higher weight decay may initially exhibit lower performance, demanding a more careful evaluation strategy, future work must explore how these principles translate across different model architectures and training regimes. Ultimately, the challenge lies in developing a more holistic understanding of pre-training, one that prioritises not just initial performance, but also the potential for future adaptation and responsible deployment.
👉 More information
🗞 Weight Decay Improves Language Model Plasticity
🧠 ArXiv: https://arxiv.org/abs/2602.11137
