Machine Learning Gains Power from New Database of 13,000 Material Bonds

Scientists are increasingly recognising the crucial role of chemical bonding in determining materials properties, yet its systematic integration into machine learning workflows remains a significant challenge. Aakash Ashok Naik, Nidal Dhamrait, and Katharina Ueltzen, working collaboratively across the Department of Materials Chemistry at the Federal Institute for Materials Research and Testing, Berlin, and the Institute of Condensed Matter Theory and Optics at Friedrich Schiller University Jena, have addressed this gap by extending their previously established Chemical Bonding Database for Solid-State to encompass approximately 13,000 materials. This expanded database, utilised with contributions from Christina Ertural, Philipp Benner, Gian-Marco Rignanese (Institute of Condensed Matter and Nanosciences (IMCN), UCLouvain), and Janine George, facilitates the derivation of novel chemical bonding descriptors and a rigorous assessment of their impact on machine learning model performance. Their research demonstrates that incorporating these descriptors substantially enhances the prediction of elastic, vibrational, and thermodynamic properties, and importantly, enables the discovery of interpretable relationships between bonding characteristics and key material behaviours.

However, the chemical bond itself remains largely unrepresented in these models. This research aimed to develop and implement bonding descriptors to improve the predictive power of machine learning for thermal conductivity. Researchers employed a combination of density functional theory calculations, symbolic regression, and machine learning techniques to quantify bonding characteristics. Specifically, they calculated phonon lifetimes and Grüneisen parameters for a dataset of 117 materials, then used symbolic regression to identify key relationships between these parameters and material properties. The resulting bonding descriptors were incorporated into machine learning models to predict thermal conductivity, demonstrating that bonding descriptors significantly enhance the accuracy of thermal conductivity predictions compared to models based solely on compositional and structural features. This database is used to derive a new set of quantum-chemical bonding descriptors, and a systematic assessment is performed using statistical significance tests to evaluate how their inclusion influences the performance of machine-learning models that otherwise rely solely on structure- and composition-derived features. Machine learning algorithms are widely used in data-driven materials discovery, both in forward and inverse design approaches. In forward design, where the goal is to predict material properties based on structure or composition, the performance of machine learning algorithms depends on how well the materials are represented by a set of features or descriptors. Numerous studies have demonstrated the utility of these descriptors for building machine learning models for screening materials for applications such as catalysis, ferroelectrics, and thermoelectrics. The concept of a chemical bond, while not a quantum mechanically observable quantity, has proven helpful in rationalising both organic and inorganic materials. To characterise bonding within solid-state materials, several theoretical frameworks have been developed, including wavefunction- and population-analysis, real-space electron density analysis, and energy partitioning methods. These frameworks routinely inform the understanding or tuning of various material properties, and the quantities obtained through such bonding analysis can serve as valuable descriptors for data-driven material discovery. To date, descriptors derived from readily available geometric information have often been used to approximate bonding in materials. A large-scale comparison of the predictive power of geometric descriptors and quantum-chemical bonding descriptors in machine learning of material properties for solid-state materials was lacking, prompting the current study. Recent developments in quantum-chemical bonding analysis workflows have enabled the high-throughput computation of quantum-chemical bonding descriptors derived from ab initio calculations. As part of this study, ICOOP, which measures the number of electrons participating in a bond; ICOHP, which quantifies covalent bond strength; and ICOBI, which indicates bond order. In addition, Mulliken and Lowdin atomic charges, projected densities of states (PDOS), and Madelung energies are also available. With this large database, the predictive value of descriptors derived from these bonding indicators was evaluated for data-driven materials discovery. Since these bonding indicators have not been comprehensively assessed in data-driven materials science, the study focused on statistical descriptors derived from COHPs, ICOHPs, and atomic charges, extracted using LobsterPy, a Python package for generating summaries of bonding characteristics and transforming data into machine-learning-ready formats. Because these descriptors quantify interatomic interactions, they are closely linked to vibrational properties, which are governed by the interatomic force constants. Accordingly, the target material properties considered included the maximum bond-projected force constant, the last peak of the phonon density of states (DOS), thermodynamic data (such as heat capacity, vibrational entropy, Helmholtz free energy, and internal energy), mean squared thermal displacements, elasticity data (bulk and shear modulus), and lattice thermal conductivity. The rationale for selecting these targets is that the bond-projected force constant measures bond stiffness, the last peak of the phonon DOS is indicative of the strongest bond, and thermodynamic properties, bulk/shear modulus, mean squared displacements, and lattice thermal conductivity are commonly correlated with chemical bonding. These bonding descriptors are orders of magnitude less expensive to compute than target properties such as phonons, elastic moduli, or thermal conductivities using standard density functional theory (DFT) simulations. The evaluation addressed three primary questions: (a) Are such quantum-chemical descriptors relevant for predicting these material properties? (b) Can such bonding descriptors be replaced by descriptors derived from compositional and structural data? (c) Do quantum-chemical chemical bonding descriptors contain complementary information that enhances predictive accuracy beyond simple compositional or structural descriptors. The study began by testing the relevance of the quantum-chemical descriptors for learning these properties, then analysed the correlations between bonding descriptors and structural or compositional descriptors to assess whether the former offer any complementary information. The impact of including these descriptors on the predictive performance of machine learning models, specifically Random Forest and MODNet, was then assessed, with significance tests conducted on trained models to determine whether any observed improvement was statistically significant. Descriptor importance from the trained models was extracted using explainable artificial intelligence (XAI) techniques, specifically Shapley additive explanations (SHAP) and permutation feature importance (PFI), to identify the most influential descriptors. When including bonding descriptors improved predictive performance, the symbolic regression method SISSO was applied to explore whether simple, intuitive expressions relating these descriptors to the property could be found. The descriptors were evaluated using multiple methods and across a variety of target material properties, with discussion focused on a small number of representative examples that best capture the study’s main conclusions. Full methodological details are provided in Section 3, and the complete set of results for all methods and targets is available on the repository’s GitHub pages. To avoid overfitting, an initial descriptor selection was performed. bond strengths (bwdf sum, Icohp sum), effective coordination numbers (EIN ICOHP), geometry-based local environment descriptors, and element-based properties, such as atomic weight and covalent radius. A high ranking of these descriptors was observed only for the cases of maximum of bond-projected force constants (max pfc), last phonon DOS peak (last ph peak), average Total/Peierls lattice thermal conductivity (log klat 300/log kp 300), bulk/shear modulus (log k vrh/log g vrh), and mean-squared displacement (log msd). The statistical bonding descriptors ranked relatively low for thermodynamic properties, including Helmholtz energy (H 25, H 305, H 705), vibrational entropy (S 25, S 305, S 705), internal energy (U 25, U 305, U 705), and heat capacity (Cv 25, Cv 305, Cv 705), with the subscripts denoting temperatures of 25 K, 305 K, and 705 K. Statistical assessment revealed significant improvements in machine learning model performance when these descriptors were incorporated alongside traditional structure- and composition-derived features. Specifically, models predicting material properties demonstrated enhanced accuracy through the inclusion of bonding information. The study focused on descriptors extracted from Crystal Orbital Hamilton Populations (COHPs), integrated COHPs (ICOHPs), and atomic charges, utilising the LobsterPy package for automated data processing. Analysis of the maximum bond-projected force constant, a measure of bond stiffness, showed improved prediction accuracy when bonding descriptors were used. Furthermore, the last peak of the phonon density of states, a benchmark property used in Matbench for evaluating machine learning models, was predicted with greater precision. This peak, indicative of the strongest bond within a material, benefited from the inclusion of bonding information in the models. Evaluations of thermodynamic properties, including heat capacity, vibrational entropy, Helmholtz free energy, and internal energy, also showed positive correlations with the new descriptors. Elasticity data, specifically bulk and shear modulus, were predicted with increased accuracy, as were mean squared thermal displacements. Notably, lattice thermal conductivity, a property closely linked to interatomic interactions, also benefited from the inclusion of quantum-chemical bonding descriptors in the machine learning models. These results demonstrate the value of incorporating bonding information into data-driven materials discovery workflows. The relentless pursuit of materials discovery demands increasingly sophisticated predictive tools. For too long, machine learning in this field has relied on describing materials simply by what they are made of, overlooking the crucial question of how those atoms are connected. This work represents a step towards rectifying that imbalance, demonstrating the power of explicitly incorporating chemical bonding information into materials modelling. The creation of an extended chemical bonding database, coupled with the development of associated descriptors, offers a pathway beyond composition-based predictions. Improvements in model accuracy across a range of properties, from elasticity to thermal conductivity, are noteworthy, but the real potential lies in the ability to uncover underlying physical relationships. The use of symbolic regression to derive intuitive expressions for these properties suggests a future where machine learning doesn’t just predict that a material will behave a certain way, but also helps us understand why. However, this is not a complete solution. The database, while substantial, remains limited in scope, and the descriptors themselves are rooted in specific bonding models. Extending this work to encompass a wider range of materials and bonding environments will be crucial. Furthermore, the true test will be applying these descriptors to genuinely novel materials, pushing beyond the bounds of existing knowledge. The next phase will likely see integration with more advanced machine learning architectures and a focus on uncertainty quantification, acknowledging the inherent limitations of any predictive model. Ultimately, this approach promises to move materials science closer to a truly predictive, rather than purely empirical, discipline.

👉 More information
🗞 A critical assessment of bonding descriptors for predicting materials properties
🧠 ArXiv: https://arxiv.org/abs/2602.12109

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Simulations Bridge Scale Gap in Understanding Cosmic Magnetic Field Origins

Simulations Bridge Scale Gap in Understanding Cosmic Magnetic Field Origins

February 13, 2026
Repulsive Interactions Between Electrons Enable Superconductivity in Two-Dimensional Systems

Repulsive Interactions Between Electrons Enable Superconductivity in Two-Dimensional Systems

February 13, 2026
Atomic Interactions Boost Signal Strength for Future Quantum Technologies

Atomic Interactions Boost Signal Strength for Future Quantum Technologies

February 13, 2026