The article discusses a machine learning force field for bio-macromolecular modeling based on quantum chemistry-calculated interaction energy datasets. The authors, ZhenXuan Fan and Sheng D Chao have used the SAPT2 level of theory to recalculate intermolecular interaction energies. They then used the CLIFF machine learning scheme to construct a general-purpose force field for biomolecular dynamics simulations. The results show that the CLIFF scheme can reproduce a diverse range of dimeric interaction energy patterns with only a small training set, with errors well below the desired chemical accuracy of 1 kcal/mol.
Introduction to Machine Learning Force Field for BioMacromolecular Modeling
ZhenXuan Fan and Sheng D Chao from the Institute of Applied Mechanics and Center for Quantum Science and Engineering at National Taiwan University have developed a machine learning force field for bio-macromolecular modeling. This is based on quantum chemistry-calculated interaction energy datasets. The researchers have used the Symmetry-Adapted Perturbation Theory (SAPT) to calculate intermolecular interaction energies. The SAPT method has been widely used in recent studies with a great level of success in modeling biomolecular segments and motifs.
Importance of Accurate Energy Data
Accurate energy data from noncovalent interactions are essential for constructing force fields for molecular dynamics simulations of bio-macromolecular systems. There are two important practical issues in the construction of a reliable force field. One is to determine a suitable quantum chemistry level of theory for calculating interaction energies. The other is to use a suitable continuous energy function to model the quantum chemical energy data.
Use of SAPT Level of Theory
The researchers have recently calculated the intermolecular interaction energies using the SAPT0 level of theory and have systematically organized these energies into the ab initio SOFG31 homodimer and SOFG31 heterodimer datasets. In this work, they recalculated these interaction energies by using the more advanced SAPT2 level of theory with a wider series of basis sets. The purpose was to determine the SAPT level of theory proper for interaction energies concerning the CCSD(T)/CBS benchmark chemical accuracy.
Application of Machine Learning Technique
To utilize these energy datasets, the researchers employed one of the well-developed machine learning techniques called the CLIFF scheme to construct a general-purpose force field for biomolecular dynamics simulations. They used the SOFG31 dataset and the SOFG31heterodimer dataset as the training and test sets respectively. The results demonstrated that using the CLIFF scheme can reproduce a diverse range of dimeric interaction energy patterns with only a small training set. The overall errors for each SAPT energy component as well as the SAPT total energy are all well below the desired chemical accuracy of 1 kcal/mol.
Quantum Chemistry-Calculated Energy Data
In the past decade, there has been an advancement in using quantum chemistry-calculated energy data to build potential energy surfaces (PESs) in the task of force field (FF) constructions. It is now a routine calculation task to employ highly correlated ab initio methods such as the second-order Møller-Plesset perturbation theory (MP2) to obtain accurate energy data for small molecular dimers with the number of atoms being less than about 50.
The SOFG31 Dataset
In their previous studies, the researchers calculated the bonding structures and interaction energies for 31 homodimers of small organic functional groups, dubbed the SOFG31 dataset, by using the MP2, CCSD(T), and the simplest SAPT0 level of theory. The SOFG31 dataset consists of 31 monomers across 8 common classes including 6 alkanes, 6 alkenes, 4 alkynes, 4 alcohols, 4 aldehydes, 3 ketones, 3 carboxylic acids, and 3 amides. Based on the SOFG31 dataset, they also performed a parallel series of calculations to obtain the bonding structures and interaction energies for heterodimers selected from the combinations of monomers in the SOFG31 dataset. This dataset is henceforth named the SOFG31heterodimer dataset.
The Role of Data Analysis Techniques in Force Field Modeling
The second problem in force field modeling is how to model the ab initio data using a proper force function. This is the point where data analysis techniques can be very useful in this specific field of molecular modeling. The task of force field modeling over wide and diverse potential energy data, including both covalent and noncovalent interaction energies, usually involves a very complicated procedure and uses the special techniques of mathematical nonlinear regression.
The article titled “A Machine Learning Force Field for Bio-Macromolecular Modeling Based on Quantum Chemistry-Calculated Interaction Energy Datasets” was published in the Bioengineering journal on January 3, 2024. The authors, Z. Fan and Sheng D. Chao discuss the development of a machine learning force field for bio-macromolecular modeling, which is based on datasets calculated through quantum chemistry.
