Researchers unlock machine learning potential with 120 million atomic configurations for materials discovery

The creation of reliable machine learning models for understanding materials behaviour currently faces a significant hurdle, stemming from the inconsistent and fragmented nature of existing atomic trajectory data. Ali Ramlaoui, Martin Siron, Inel Djafar, and colleagues at Entalpic address this challenge with the introduction of LeMat-Traj, a comprehensive and unified dataset containing over 120 million atomic configurations. This curated collection, built from large materials repositories, standardizes data formats and ensures high quality across commonly used computational methods, thereby lowering the barriers to developing accurate and transferable machine learning interatomic potentials. LeMat-Traj, which encompasses both stable and high-energy states, demonstrably improves the performance of machine learning models, reducing errors in predicting material behaviour and paving the way for accelerated materials discovery. The team also provides LeMaterial-Fetcher, an open-source library designed to facilitate the ongoing expansion and maintenance of large-scale materials datasets, ensuring its long-term utility for the wider research community.

This large-scale resource aggregates data from prominent repositories including the Materials Project, Alexandria, and OQMD, significantly lowering the barrier to training accurate and transferable machine learning interatomic potentials. LeMat-Traj standardizes data representation and harmonizes results from calculations performed with widely used Density Functional Theory (DFT) functionals, ensuring consistency across diverse sources. The team also developed LeMaterial-Fetcher, a modular and extensible open-source library, to automate the process of fetching, transforming, validating, and harmonizing data from various sources, creating a reproducible framework for materials science datasets.

Experiments demonstrate that fine-tuning a MACE model with LeMat-Traj reduces force prediction errors on relaxation tasks by over 36%, and improves performance on the Matbench Discovery stability benchmark by 10%, highlighting the dataset’s ability to enhance the accuracy of materials modeling. LeMat-Traj uniquely provides dense coverage of both near-equilibrium and low-force states, a previously underrepresented but crucial regime for accurate geometry optimization. Analysis demonstrates that LeMat-Traj comprehensively samples the potential energy surface during relaxation pathways, capturing both high-energy structures and states near equilibrium, making it a valuable resource for advancing interatomic potential development, multi-fidelity learning, and self-supervised learning techniques. Researchers have introduced LeMat-Traj, a large-scale collection of DFT calculations designed to improve the accuracy and generalizability of machine learning potentials for materials science.

This dataset combines data from the Materials Project,

This dataset combines data from the Materials Project, MPtrj, OQMD, and Alexandria, providing a comprehensive and diverse resource for training machine learning potentials, particularly focusing on improving performance in the low-force regime, which is important for accurate relaxations and structural predictions. Rigorous filtering and processing of data from various sources ensures quality and consistency. The research demonstrates that different data generation strategies, such as molecular dynamics and active learning versus geometry optimization, capture distinct but complementary regions of the potential energy surface. A single data source is often insufficient for creating a truly general-purpose potential.

Models trained on LeMat-Traj, especially when combined with high-force datasets, achieve superior performance in predicting energies, forces, and stresses. Principal Component Analysis reveals that LeMat-Traj and MPtrj have similar potential energy surface landscapes, with LeMat-Traj offering higher resolution due to its larger size. Models consistently perform best on their in-distribution test data, reinforcing the importance of diverse training data.

👉 More information
🗞 LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
🧠 ArXiv: https://arxiv.org/abs/2508.20875
Dr. Donovan

Dr. Donovan

Dr. Donovan is a futurist and technology writer covering the quantum revolution. Where classical computers manipulate bits that are either on or off, quantum machines exploit superposition and entanglement to process information in ways that classical physics cannot. Dr. Donovan tracks the full quantum landscape: fault-tolerant computing, photonic and superconducting architectures, post-quantum cryptography, and the geopolitical race between nations and corporations to achieve quantum advantage. The decisions being made now, in research labs and government offices around the world, will determine who controls the most powerful computers ever built.

Latest Posts by Dr. Donovan:

The mind and consciousness explored through cognitive science

Two Clicks Enough for Expert Echolocators to Sense Objects

April 8, 2026
Bloomberg: 21 Factored: Quantum Risk to Crypto Not Imminent Now

Adam Back Says Quantum Risk to Crypto Not Imminent Now

April 8, 2026
Fully programmable quantum computing with trapped-ions

Fully programmable quantum computing with trapped-ions

April 8, 2026