The quest to predict material properties from molecular structure currently limits the pace of scientific innovation, as existing methods struggle with the vastness of possible chemical compounds. Alexius Wadell, Anoushka Bhutani, and colleagues from the University of Michigan and the University of Illinois at Urbana-Champaign address this challenge by developing MIST, a new family of molecular foundation models. These models, trained on an unprecedented scale with comprehensive molecular data, demonstrate a remarkable ability to predict over 400 structure-property relationships with state-of-the-art accuracy across diverse fields including physiology and electrochemistry. Importantly, MIST not only solves practical problems like electrolyte screening and isotope prediction, but also reveals underlying scientific concepts through mechanistic interpretability, suggesting the models learn generalizable knowledge, and the team has developed methods to significantly reduce the computational cost of building such models, representing a major advance in accelerating materials discovery and design.
Electrolyte Chemistry and Machine Learning Methods
This research encompasses a diverse collection of sources spanning chemistry, materials science, computer science, and statistics. The materials primarily focus on electrolytes, solvents, and battery materials, with frequent references to specific compounds like LiTFSI, LiPF6, and EC/DMC. A significant portion of the work details neural networks, optimization algorithms, and efficient computation techniques, alongside materials informatics and databases of materials properties. The collection includes practical implementations through Julia packages, such as those for statistical functions and diagnostic tools.
Adherence to standards for quantities and units, exemplified by BS EN 80000-13:2008, underscores the importance of accurate measurement. A substantial number of entries are preprints available on arXiv, indicating a focus on cutting-edge research. Accessible explanations of techniques are also provided through blog posts and online resources.
SMILES Tokenization and Large Molecular Models
Scientists have developed Molecular Insight SMILES Transformers, or MIST, a family of molecular foundation models designed to comprehensively explore chemical space. The research team engineered models with up to an order of magnitude more parameters and data than previous efforts, enabling a significantly broader understanding of molecular properties. Central to this work is Smirk, a novel tokenization scheme that captures detailed molecular information, including nuclear, electronic, and geometric features, providing a richer representation of molecular structure than existing methods. To demonstrate MIST’s capabilities, scientists fine-tuned the models to predict over 400 structure-property relationships, achieving performance matching or exceeding state-of-the-art benchmarks across diverse fields including physiology, electrochemistry, and quantum chemistry.
The study pioneered the application of these models to solve real-world problems, including multiobjective electrolyte solvent screening, olfactory perception mapping, isotope half-life prediction, and the prediction of properties for both binary and multi-component mixtures. Researchers further investigated the learned representations within MIST, revealing identifiable patterns and trends not explicitly present in the training data, suggesting the models learn generalizable scientific concepts during training. To address the computational demands of training such large models, the team formulated hyperparameter-penalized Bayesian neural scaling laws, reducing the computational cost of model development by an order of magnitude. The research team open-sourced all code, model weights, and training recipes, facilitating further exploration of chemical space and accelerating materials discovery, design, and optimization.
Molecular Foundation Models Surpass Benchmarks
Scientists have developed MIST, a family of molecular foundation models ranging in size from a few million to 1. 8 billion parameters, trained on datasets of up to 6 billion molecules. These models utilize a novel tokenization algorithm, Smirk, which captures nuclear, electronic, and geometric features, enabling a richer representation of molecular structure than existing methods. The team pretrained two primary encoders, MIST-28M (28 million parameters trained on 245 million molecules) and MIST-1. 8B (1.
8 billion parameters trained on 2 billion molecules), using the Masked Language Modeling objective on synthetically accessible organic molecules. The research demonstrates that fine-tuned MIST models achieve state-of-the-art performance across numerous chemical machine learning benchmarks. These models were successfully fine-tuned on over 400 molecular and formulation property prediction tasks, requiring labelled datasets as small as 200 examples. Notably, the team developed and leveraged hyperparameter-penalized neural scaling laws, reducing the computational cost of model development by an order of magnitude, saving over 10 petaflop-days of compute.
The team showcased MIST’s capabilities through high-throughput screening for electrolyte design, a critical process for energy storage technologies. Using fine-tuned MIST-28M models, they built a pipeline to identify novel electrolyte solvent molecules with large electrochemical stability and wide operating liquid ranges. Predictions of electrochemical stability were achieved by fine-tuning MIST-28M on the QM9 dataset, while thermal stability was predicted using a Characteristic Temperature dataset to predict melting and boiling points. Furthermore, the models demonstrate predictive power across diverse chemical spaces, including the ability to predict olfactory perception. The models can accurately categorize a wide range of scents, including aromas like anise, apple, and cedar, demonstrating the versatility of the foundation model approach. These achievements represent a significant step towards accelerating materials discovery, design, and optimization using foundation models.
Molecular Property Prediction via Foundation Models
The development of MIST represents a significant advance in the application of foundation models to materials science, demonstrating the potential to accurately predict a wide range of molecular properties from structural information. Researchers successfully created a family of molecular foundation models, substantially larger and trained on more data than previous efforts, achieving state-of-the-art performance across diverse benchmarks including those related to physiology and electrochemistry. These models were effectively fine-tuned to predict over 400 structure-property relationships, and successfully applied to practical problems such as identifying optimal electrolyte solvents, mapping olfactory perception, and predicting isotope half-lives. Importantly, probing the MIST models revealed that they learn generalizable scientific concepts, identifying patterns not explicitly present in the training data, suggesting the models are not simply memorizing data but developing an understanding of underlying principles. The team also developed new Bayesian scaling laws and methods to reduce the computational cost of model development, making this approach more accessible.
👉 More information
🗞 Foundation Models for Discovery and Exploration in Chemical Space
🧠 ArXiv: https://arxiv.org/abs/2510.18900
