Researchers are increasingly focused on data mixing, establishing optimal ratios of data from diverse sources, as a critical element in training large language models. Mayee F. Chen from the Allen Institute for AI and Stanford University, Tyler Murray from the Allen Institute for AI, and David Heineman from the Allen Institute for AI, working with colleagues at the Allen Institute for AI, the University of Washington, and Stanford University, introduce Olmix, a novel framework designed to tackle key challenges encountered during practical language model development. This work addresses the poorly understood configuration space of data mixing methods through comprehensive empirical study, identifying impactful design choices. Furthermore, Olmix innovatively handles evolving datasets, a common scenario in real-world development, by introducing ‘mixture reuse’, a technique that significantly reduces computational cost, achieving comparable performance to full recomputation with 74% less compute and an 11.6% improvement over training without data mixing on downstream tasks.

Scientists have developed Olmix, a new framework designed to optimise data mixing during a lack of understanding regarding the optimal configuration of mixing methods and the difficulty of efficiently updating mixtures as datasets evolve during the LM development process. The research begins by acknowledging the poorly defined configuration space of current data mixing techniques, often lacking justification for design choices and failing to account for practical constraints such as limited data availability. Through a comprehensive empirical study, scientists identified seven key design choices influencing the effectiveness of a mixing method, revealing that the required number of initial training runs scales linearly with the number of data domains used. The study demonstrates that the selection of an appropriate regression model is dependent on the size of the initial training set, with a log-linear model proving most effective overall. To address the dynamic nature of LM development, where datasets are frequently added, removed, or revised, researchers introduced “mixture reuse”, a mechanism that intelligently reuses existing mixture ratios for unchanged data domains, significantly reducing computational cost. Over a sequence of five domain-set updates mirroring real-world LM development, this technique matched the performance of fully recomputing the mixture after each update, while requiring 74% less compute and delivering an 11.6% improvement on downstream tasks compared to training without mixing. Incorporating data constraints into the mixture optimisation problem effectively controlled sample repetition, preventing performance degradation caused by excessive data reuse. These constraints demonstrably shaped the proposed mix, ensuring a more balanced and effective data distribution. The research detailed the performance-cost spectrum of recomputation strategies, with FullMixtureReuse freezing weights for unaffected domains and recomputing only for those impacted by updates. When the optimal ratios changed minimally and coupling between reused and recomputed domains was low, FullMixtureReuse achieved performance comparable to full recomputation, but at a significantly reduced computational cost. PartialMixtureReuse offered a further refinement, selectively recomputing mixes on some unaffected domains to reduce coupling effects and further narrow the performance gap to full recomputation, albeit with a slight increase in cost. Olmix, a framework designed to optimise data mixing for LM development, began with a comprehensive empirical study of the mixing method configuration space, addressing a lack of justification and consensus surrounding design choices within existing methods. Researchers identified seven key design choices necessary to implement an offline mixing schema, a three-step process involving training proxy models, fitting a regression model, and proposing a final mixture. A central component of this work involved determining the minimum number of proxy runs, the ‘swarm size’, required to learn an effective mix, revealing a linear scaling relationship between swarm size and the number of domains. Analysis of regression model families revealed that the log-linear model, adapted from previous work, consistently achieved the best overall downstream performance, particularly when considering varying swarm sizes. To address the practical limitation of finite data availability, the research incorporated constraints into the mixture optimisation problem, controlling sample repetition and preventing performance degradation. Beyond initial configuration, Olmix tackles the challenge of evolving domain sets during LM development, a scenario largely ignored by existing work. The team introduced ‘mixture reuse’, a mechanism that leverages information from past mixtures to efficiently recompute ratios only for domains affected by updates. This technique was tested over a sequence of five domain-set updates, mirroring real-world LM development, and achieved performance comparable to fully recomputing the mix after each update while reducing compute by 74%. when the ingredients of your training data change mid-recipe, how do you avoid starting from scratch. For years, developers have been forced to fully recalculate data mixtures whenever datasets are updated, a computationally expensive and time-consuming process. Olmix offers a solution by intelligently reusing previously computed ratios, focusing only on the areas affected by the changes. The reported gains, a substantial reduction in compute while maintaining performance, are significant, particularly as language models become ever more data-hungry. However, the real-world complexity of data drift isn’t fully captured by the controlled updates used in this study. Datasets evolve in unpredictable ways, and the effectiveness of ‘mixture reuse’ may diminish when faced with more radical shifts in data distribution. Furthermore, the optimal balance between reusing old ratios and recomputing new ones likely varies depending on the specific domains and the nature of the updates. Future work should explore adaptive strategies that automatically adjust this balance. Ultimately, Olmix isn’t just about faster training; it’s about building more robust and adaptable language models capable of learning continuously from a changing world. Empirical evaluation across 64 domains and 100 billion tokens demonstrated the effectiveness of combining the optimised configuration, OlmixBase, with the reuse mechanisms, solidifying Olmix as a practical solution for evolving language model development.

👉 More information
🗞 Olmix: A Framework for Data Mixing Throughout LM Development
🧠 ArXiv: https://arxiv.org/abs/2602.12237

Tags:

data constraints Data mixing Domain Adaptation downstream tasks empirical study Language Models LM development. mixture reuse

Researchers Improve Language Model Training by Evolving Data Mixtures, Achieving Gains of Nearly 12 Per Cent

Rohail T.

Latest Posts by Rohail T.:

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm