The creation of reliable machine learning models for understanding materials behaviour currently faces a significant hurdle, stemming from the inconsistent and fragmented nature of existing atomic trajectory data. Ali Ramlaoui, Martin Siron, Inel Djafar, and colleagues at Entalpic address this challenge with the introduction of LeMat-Traj, a comprehensive and unified dataset containing over 120 million atomic configurations. This curated collection, built from large materials repositories, standardizes data formats and ensures high quality across commonly used computational methods, thereby lowering the barriers to developing accurate and transferable machine learning interatomic potentials. LeMat-Traj, which encompasses both stable and high-energy states, demonstrably improves the performance of machine learning models, reducing errors in predicting material behaviour and paving the way for accelerated materials discovery. The team also provides LeMaterial-Fetcher, an open-source library designed to facilitate the ongoing expansion and maintenance of large-scale materials datasets, ensuring its long-term utility for the wider research community.
This large-scale resource aggregates data from prominent repositories including the Materials Project, Alexandria, and OQMD, significantly lowering the barrier to training accurate and transferable machine learning interatomic potentials. LeMat-Traj standardizes data representation and harmonizes results from calculations performed with widely used Density Functional Theory (DFT) functionals, ensuring consistency across diverse sources. The team also developed LeMaterial-Fetcher, a modular and extensible open-source library, to automate the process of fetching, transforming, validating, and harmonizing data from various sources, creating a reproducible framework for materials science datasets.
Experiments demonstrate that fine-tuning a MACE model with LeMat-Traj reduces force prediction errors on relaxation tasks by over 36%, and improves performance on the Matbench Discovery stability benchmark by 10%, highlighting the dataset’s ability to enhance the accuracy of materials modeling. LeMat-Traj uniquely provides dense coverage of both near-equilibrium and low-force states, a previously underrepresented but crucial regime for accurate geometry optimization. Analysis demonstrates that LeMat-Traj comprehensively samples the potential energy surface during relaxation pathways, capturing both high-energy structures and states near equilibrium, making it a valuable resource for advancing interatomic potential development, multi-fidelity learning, and self-supervised learning techniques. Researchers have introduced LeMat-Traj, a large-scale collection of DFT calculations designed to improve the accuracy and generalizability of machine learning potentials for materials science.
This dataset combines data from the Materials Project, MPtrj, OQMD, and Alexandria, providing a comprehensive and diverse resource for training machine learning potentials, particularly focusing on improving performance in the low-force regime, which is important for accurate relaxations and structural predictions. Rigorous filtering and processing of data from various sources ensures quality and consistency. The research demonstrates that different data generation strategies, such as molecular dynamics and active learning versus geometry optimization, capture distinct but complementary regions of the potential energy surface. A single data source is often insufficient for creating a truly general-purpose potential.
Models trained on LeMat-Traj, especially when combined with high-force datasets, achieve superior performance in predicting energies, forces, and stresses. Principal Component Analysis reveals that LeMat-Traj and MPtrj have similar potential energy surface landscapes, with LeMat-Traj offering higher resolution due to its larger size. Models consistently perform best on their in-distribution test data, reinforcing the importance of diverse training data.
👉 More information
🗞 LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
🧠 ArXiv: https://arxiv.org/abs/2508.20875
