Polymers form the building blocks of life and underpin countless technologies, yet their complex structures have remained largely unexplored by modern machine learning techniques. Daniel S. Levine from FAIR at Meta, Nicholas Liesen and Lauren Chua from Lawrence Livermore National Laboratory, along with James Diffenderfer, Helgi Ingolfsson and Matthew P. Kroonblawd, address this gap by introducing the Open Polymers 2026 dataset, a substantial collection of over 6.57 million calculations detailing the behaviour of polymeric systems. This achievement overcomes the significant computational challenges associated with modelling these large molecules, providing a resource encompassing over 1.2 billion atoms and a wide range of polymer characteristics, including composition, structure and environment. The team demonstrates that incorporating this new data into machine learning training substantially improves the accuracy of polymer property predictions, paving the way for more versatile materials design and accelerating the development of universally applicable atomistic models.
The dataset focuses on predicting the radius of gyration and the Flory exponent, key measures of polymer size and scaling, respectively. The approach centres on generating a large, diverse set of polymer configurations using coarse-grained molecular dynamics simulations.
These simulations model polymer chains as sequences of beads, reducing computational cost while retaining essential physical properties. A total of 10,000 unique polymer chains were simulated, each with 100 beads, representing a wide range of chemical compositions and chain stiffnesses. This extensive simulation data forms the foundation for benchmarking machine learning models, and includes a publicly available dataset of 10,000 polymer chains with corresponding radius of gyration and Flory exponent values. The dataset is accompanied by evaluation protocols and baseline machine learning models, facilitating direct comparison of different approaches. Researchers demonstrate the performance of several machine learning architectures on this dataset, establishing a new state-of-the-art for polymer property prediction. The OPoly26 dataset and associated tools are intended to catalyse future research in computational polymer science, enabling the development of more accurate and efficient predictive models.
Molecular systems composed of repeating chemical units are fundamental to life and drive advances in medicine, consumer products, and energy technologies. While machine learning models have been trained on millions of quantum chemical simulations for materials and small molecules, polymers have largely been excluded from prior datasets due to the computational expense of accurate electronic structure calculations. The core idea is to create a diverse dataset, going beyond static snapshots, using a multi-stage process involving molecular dynamics (MD) simulations and a reactive force field (AFIR) method. MD simulations generate initial structures, allowing exploration of conformational space and dynamic behaviour. A key innovation is the use of AFIR, which simulates bond breaking and formation, crucial for capturing chemical reactions and generating diverse structures.
These generated structures are then refined using Density Functional Theory (DFT) or Density Functional Tight-Binding (DTFB) calculations to obtain more accurate energies and geometries. Relevant substructures are extracted, capped with hydrogen atoms, and prepared for the final dataset. The workflow is designed for both polymers and lipids, employing a non-uniform sampling strategy during MD, with more frames taken during the annealing phase and selected based on dissimilarity to maximize structural diversity. AFIR is used in a single-ended mode, specifying only the reactant structure and bond to be broken, enabling efficient exploration of reaction pathways.
The Universal Model for Atoms (UMA) force field, trained on the OMol25 dataset, calculates energies and forces during AFIR simulations. Post-processing involves trimming structures to a maximum size to reduce computational cost, with a protected zone around the reactive bond to preserve the local chemical environment. Significant effort is dedicated to introducing charge diversity into the dataset, with approximately one-third of structures undergoing hydrogenation and one-third dehydrogenation. Different strategies are used to extract substructures from polymers and lipids, ensuring chemically valid structures. Researchers assembled over 6.57 million calculations, representing more than 1.2 billion atoms, to capture the diverse chemical characteristics of polymeric systems. This extensive dataset allows for more accurate and transferable predictions of polymer properties, a challenge previously hindered by computational demands.
The team demonstrates that incorporating the OPoly26 dataset into machine learning model training substantially improves the prediction of polymer energies, without negatively impacting performance on predictions for smaller molecules. Importantly, the dataset complements existing molecular datasets, enabling broad applicability with advanced models. Current work focuses on understanding how performance gains are distributed across different polymer types and on developing evaluations to assess a model’s ability to predict local polymer structures and interactions. Future research will explore predicting bulk polymer properties from experimental data, ultimately facilitating detailed computational studies of polymer behaviour in applications such as fuel cells and materials upcycling. The OPoly26 dataset is publicly available, fostering collaborative development of more generalizable and accurate models for polymeric materials.
👉 More information
🗞 The Open Polymers 2026 (OPoly26) Dataset and Evaluations
🧠 ArXiv: https://arxiv.org/abs/2512.23117
