Researchers are tackling the challenge of discovering new materials with specific properties using artificial intelligence. Anand Babu, Rogério Almeida Gouvêa, and Pierre Vandergheynst, alongside Gian-Marco Rignanese and colleagues from Université Catholique de Louvain and EPFL, present MEIDNet, a novel multimodal generative AI framework for inverse materials design. This work is significant because it accelerates materials discovery by efficiently exploring the vast chemical-structural landscape and identifying compounds that meet predefined criteria. MEIDNet achieves this through a combination of contrastive and cross-modal learning, demonstrating a remarkable 60-fold increase in learning efficiency and successfully generating promising low-bandgap perovskite structures with a 13.6% stable, unique, and novel rate, validated by first-principles calculations.
This work is significant because it accelerates materials discovery by efficiently exploring the vast chemical-structural landscape and identifying compounds that meet predefined criteria.
MEIDNet learns materials structure and properties
By combining generative inverse design with multimodal learning, their approach accelerates the exploration of chemical-structural space and facilitates the discovery of materials that satisfy predefined property targets. MEIDNet exhibits strong latent-space alignment with cosine similarity ≈ 0.96 by fusion of three modalities through cross-modal learning. The search for materials with desired properties is crucial for various applications, including energy storage, electronics, optoelectronics, and biomedical devices. Conventional trial-and-error approaches are resource-intensive and have a limited scope.
AI-enabled computational inverse design provides an efficient way to find candidates that satisfy predefined functional targets. It exploits learned structure-property relationships to efficiently navigate complex chemical and structural landscapes. This approach significantly accelerates discovery cycles, guiding experimental efforts toward more targeted explorations. Generative AI models, including variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models have demonstrated promising performance in the design and discovery of new materials. However, most frameworks rely on a single mode of information, which limits their effectiveness in fully capturing the complex interplay among multiple property dimensions.
To address these limitations, multimodal machine learning (ML) has gained traction. By incorporating diverse sources of information, such as structural, electronic, mechanical, and thermodynamic properties, it facilitates the creation of a robust chemical latent space through shared learning. Recent efforts have broadened the reach of multimodal frameworks by incorporating techniques such as contrastive learning, cross-modal attention mechanisms, and constrained-driven materials exploration. For instance, the multimodal foundation model (MultiMat) integrates crystal structures, density of states (DOS), charge densities, and textual descriptions to uncover materials through latent-space fusion.
The composition-structure bimodal network (COSNet) improves the prediction accuracy for experimentally measured composition and structural data. SCIGEN integrates geometric lattice constraints into a diffusion-based crystal generator to discover stable motif-guided quantum materials and validates a large subset via prescreening and DFT. Similarly, denoising diffusion techniques coupled with cross-modal contrastive learning have enabled the guided discovery of chemical compositions and crystal structures from textual prompts. It requires extensive training and tighter alignment across modalities. Multimodality is clearly the future of materials science, yet only a handful of studies have addressed this challenge.
They use the Perovskite-5 dataset (~19k five-atom ABX3 structures) as a benchmark for their multimodal model. It is simple, widely used, and includes technologically relevant classes of materials, such as photovoltaics, ferroelectrics, and high-κ dielectrics. The cosine similarity and the L2 distances between modality embeddings are ~0.96 and ~0.24, respectively, indicating strong alignment. As a demonstration, 140 perovskite crystal structures are generated targeting thermodynamically stable materials in the low bandgap range (from 0.8 to 1.5 eV). The predicted bandgaps for the generated crystals closely match those determined with the crystal graph convolutional neural network (CGCNN) for single-property prediction (with an MAE of ~0.02).
However, subsequent ab initio calculations show a lowering of the prediction. Out of the 140 generated perovskites, 19 are found to be stable, unique and novel, leading to a calculated SUN rate of ~13.6% without any further filtering, which is state of the art for multimodal models. The crystal encoder transforms 3D crystal structures into an embedded latent representation using an EGNN for the structural encodings. It is implemented by message passing, feature aggregation/update, and coordinate update, and is equivariant to translations, rotations/reflections, and node permutations. mij= φe(hi l, hj l, ||xi l−xj l|| 2, aij), mi= Σj≠imij; hi (l+1) = φh(hi l, mi), xi (l+1) = xi l+ C· Σj≠i(xi l−xj l) · φx(mij), and zs= 1 N∑ hi (L) N i=1 (mij).
Here xi l∈Rn are node coordinates, hi l∈Rd node features, aij edge attributes; mij the message (edge embedding) sent from node j to node i, φe, φx, φh multilayer perceptrons (MLPs), and C a normalization constant. Finally, Eq. (iv) represents the global pooling operation where hi (L) are the node features after the final EGNN layer L, N is the number of atoms in the unit cell, and zs is the resulting structural latent embedding. The reconstruction fidelity of the autoencoder increases with the latent-space dimensionality, but so does the computational cost. To assess this, they consider three latent dimensions, namely 64, 128, and 256, using the same number of training epochs on the Perovskite-5 dataset as a benchmark.
The 128‐dimensional latent space emerges as the sweet spot, yielding a higher structure‐matching (SM) rate at roughly the same computational cost. Therefore, they use a 128‐dimensional equivariant crystal autoencoder for multimodal alignment. To validate the generalizability of their EGNN encoder, they also determine the reconstruction fidelity on two other diverse datasets: MP-20 and Carbon-24. Thanks to the EGNN integrated autoencoder architecture, MEIDNet outperforms the unimodal Fourier transformed crystal properties (FTCP) and crystal diffusion variational autoencoder (CDVAE) for all three datasets in the SM rate (Table 1).
Table 1 compares the reconstruction performance of MEIDNet, FTCP, and CDVAE for the Perovskite-5, MP-20, and Carbon-24 datasets. In early fusion, modality-specific features are merged at the input feature level, and a shared network learns a joint representation. In late fusion, each modality is encoded separately, and the resulting embeddings are combined at the alignment stage. Contrastive learning unifies structural and property encodings in a joint latent space, which facilitates interactions between distinct modalities and optimizes the alignment of their information. The alignment between modalities is achieved via contrastive training using the InfoNCE loss LInfoNCE = − 1 B∑ log B k=1 exp(sim(zs (k),zp (k))/τ) ∑ exp(sim(zs (k),zp (l))/τ) B l=1, where B is the batch size, and indices (k) and (l) denote samples within the mini batch. zs (k) and zp (k) are the aligned structural and property embeddings for the k-th crystal, sim(u, v) = uTv/(‖u‖‖v‖) denoting cosine similarity. τ is the temperature hyperparameter, and the denominator sums over all l samples in the batch to normalize the probability.
They implement an inverse design pipeline for target-led navigation in the aligned latent space. An iterative optimization is performed until the predicted properties converge to the targeted values for the generated crystal structure. Thus, MEIDNet offers a more compact latent space than diffusion models, facilitating interpretability and navigation. In addition to this, it is scalable to numerous modalities toward a unified latent representation.
MEIDNet framework and multimodal materials Representation learning offer
Experiments employed three diverse datasets, Perovskite-5, MP-20, and Carbon 24, to rigorously test the framework’s performance across varying material classes. The study pioneered a curriculum learning strategy, resulting in approximately 60times greater learning efficiency compared to conventional training methods. Researchers utilised the Perovskite-5 dataset, comprising approximately 19,000 five-atom ABX3 structures, as a benchmark for multimodal model performance. Analysis of modality embeddings revealed L2 distances of approximately 0.24, further confirming strong alignment between the learned representations.
To demonstrate the potential of MEIDNet, the team generated 140 perovskite crystal structures targeting thermodynamically stable materials with low bandgaps, ranging from 0.8 to 1.5 eV. Predicted bandgaps from the generated crystals closely matched those obtained using a crystal graph convolutional neural network (CGCNN) for single-property prediction, with a mean absolute error of approximately 0.02 eV. This SUN rate represents a state-of-the-art achievement for multimodal materials design models. The approach enables efficient exploration of chemical-structural space and facilitates the discovery of materials satisfying predefined property targets, demonstrating both scalability and adaptability for universal learning across diverse modalities.
MEIDNet achieves efficient perovskite structure generation and prediction
The crystal encoder transforms 3D crystal structures into embedded latent representations, employing the EGNN for structural encodings via message passing, feature aggregation, and coordinate updates. Results confirm the framework’s equivariance to translations, rotations, reflections, and node permutations, crucial for accurate structural representation. A 128-dimensional latent space was identified as optimal, yielding a higher structure-matching (SM) rate at a comparable computational cost, as detailed in supplementary Figure S1. Tests prove MEIDNet outperforms unimodal Fourier transformed crystal properties (FTCP) and crystal diffusion variational autoencoder (CDVAE) across the Perovskite-5, MP-20, and Carbon-24 datasets in terms of SM rate, as shown in Table 1.
Specifically, MEIDNet achieved SM rates of 99.85 % for Perovskite-5, 66.4 % for Carbon-24, and 72.35 % for MP-20, exceeding the performance of both FTCP and CDVAE. The property encoder, implemented as a Multilayer Perceptron (MLP), projects scalar material properties, bandgap and formation enthalpy, into a 128-dimensional embedding, ensuring dimensionality consistency for alignment via contrastive learning. Contrastive learning, utilising the InfoNCE loss function, unifies structural and property encodings, optimising information alignment between modalities. The alignment is quantified by cosine similarity, with the temperature hyperparameter τ controlling the sensitivity of the contrastive loss. An iterative optimisation pipeline was implemented for target-led navigation in the aligned latent space, converging predicted properties to targeted values for generated crystal structures, offering a more compact latent space than diffusion models and facilitating interpretability.
MEIDNet efficiently designs stable perovskite structures for high-performance
This work establishes a promising strategy integrating multimodal generative models with dynamic instability remediation and validation techniques, effectively delivering physically grounded materials for experimental investigation. The authors acknowledge the current focus on electronic and thermodynamic properties, but highlight the model’s adaptability to other properties and its potential for creating a universal latent representation across diverse modalities. Future work could extend this framework to explore a wider range of chemical systems and material properties, accelerating computational materials discovery.
👉 More information
🗞 MEIDNet: Multimodal generative AI framework for inverse materials design
🧠 ArXiv: https://arxiv.org/abs/2601.22009
