Scientists are increasingly utilising graph generative models to accelerate materials discovery, yet a fundamental limitation remains largely unaddressed: the point at which these models produce unreliable structures as size increases. Can Polat, Erchin Serpedin, Mustafa Kurban, and Hasan Kurban, from Texas A&M University, Ankara University, and Hamad Bin Khalifa University, characterise this ‘extrapolation frontier’ through the introduction of RADII, a benchmark comprising 75,000 nanoparticle structures. This research is significant because it systematically measures the scaling limits of these models, revealing substantial variation in performance and establishing output scale as a crucial evaluation metric for geometric generative models. Their analysis of five state-of-the-art architectures demonstrates that prediction accuracy degrades predictably with increasing size, offering a pathway to quantitatively forecast model reliability and improve nanomaterial design.
Every generative model possesses an extrapolation frontier, a critical structure size beyond which its outputs become unreliable, and this research establishes a method for quantifying that limit.
The work addresses a significant gap in materials science by providing the first systematic measurement of this frontier, crucial for the design of nanomaterials with predictable properties. RADII employs radius as a continuous scaling parameter, allowing researchers to trace generation quality from within the training distribution to regimes where models are extrapolating beyond learned data.
This approach utilises a leakage-free data split, ensuring a clear separation between in-distribution interpolation and out-of-distribution extrapolation, enabling precise identification of each model’s scaling ceiling. The benchmark incorporates frontier-specific diagnostics, including per-radius error profiles to pinpoint scaling limits, surface-interior decomposition to determine the origin of failures, and cross-metric failure sequencing to reveal which structural aspects degrade first.
Benchmarking five state-of-the-art architectures revealed that all models exhibit a degradation of approximately 13% in global positional error beyond their training radii. However, the fidelity of local bond arrangements diverges considerably, ranging from near-perfect preservation to over twofold collapse.
Notably, no two architectures share the same failure sequence, demonstrating that the extrapolation frontier is a multi-dimensional surface shaped by the underlying model family. Well-behaved models adhere to a power-law scaling exponent of approximately 1/3, allowing for accurate prediction of out-of-distribution error based on in-distribution performance.
These findings establish output scale as a primary evaluation criterion for geometric generative models, moving beyond traditional assessments focused on fixed output sizes. The dataset and associated code are publicly available, facilitating further research and development in this critical area of materials science.
Radius extrapolation via systematic nanoparticle size variation
A radius-resolved benchmark named RADII systematically maps the extrapolation frontier of generative models for crystalline materials using 75,000 nanoparticle structures. These structures contain between 55 and 11,298 atoms and span a radius range enabling continuous scaling from in-distribution to out-of-distribution regimes.
The research employed a leakage-free data split to cleanly separate orientation interpolation, representing in-distribution data, from radius extrapolation, defining out-of-distribution data. RADII utilises radius as a continuous scaling knob, linking primitive unit cells to nanoparticles across 25 size configurations ranging from 0.6 to 3.0nm.
This approach allows for precise identification of each model’s scaling ceiling and facilitates the tracing of per-radius error profiles for every model-material pair. Failures are then decomposed into contributions originating from the surface versus the interior of the generated structures. The study quantifies extrapolation gap severity across complementary metrics, providing a comprehensive assessment of structural fidelity.
Benchmarking five state-of-the-art generative architectures, researchers assessed global positional error and local bond fidelity, revealing significant divergence in the latter across models, ranging from near-zero to over collapse beyond training radii. Furthermore, the work demonstrates that no two architectures share the same failure sequence, highlighting the multi-dimensional nature of the extrapolation frontier shaped by model family and material symmetry.
Analysis revealed that well-behaved models adhere to a power-law scaling exponent, allowing for accurate prediction of out-of-distribution error based on in-distribution fit. This establishes output scale as a crucial evaluation axis for geometric generative models and demonstrates that current scaling limits are predictable rather than random. The dataset is publicly available to facilitate reproducibility and further development in the field.
Radius-dependent performance limits of nanoparticle generative models
RADII, a radius-resolved benchmark comprising 75,000 nanoparticle structures ranging from 55 to 11,298 atoms, introduces a method for tracing generation quality from in-distribution to out-of-distribution regimes. This work treats radius as a continuous scaling knob, enabling leakage-free splits for analysis.
Per-radius error profiles pinpoint each architecture’s scaling ceiling, revealing the point at which performance begins to degrade. Interior decomposition tests determine whether failures originate at boundaries or within the bulk of the generated structures, providing insights into the nature of the errors.
Benchmarking five state-of-the-art architectures revealed that all models exhibit degradation in global positional error beyond their training radii. However, local bond fidelity diverges significantly across architectures, ranging from near-zero to over collapse. No two architectures share the same failure sequence, demonstrating that the extrapolation frontier is a multi-dimensional characteristic shaped by the model family.
Well-behaved models adhere to a power-law scaling exponent, and in-distribution fits accurately predict out-of-distribution error, making their frontiers quantitatively forecastable. The research establishes output scale as a primary evaluation axis for geometric generative models. The dataset facilitates frontier-specific diagnostics, allowing for detailed analysis of model performance at different scales.
These findings are particularly relevant because physics-based alternatives struggle to cover the size ranges that generative models are increasingly tasked with targeting. Kohn, Sham DFT, while accurate, scales as O(N3), limiting routine simulations to smaller structures. RADII targets geometric scaling behaviour, making physics-based generation unnecessary and infeasible at benchmark scale.
Deterministic symmetry-preserving construction provides the scalable ground truth needed to map extrapolation frontiers. Power-law scaling relationships, with an identified exponent of approximately 1/3, connect to established principles in neural scaling laws and finite-size scaling in statistical physics. This connection allows for a deeper understanding of the relationship between structure size and generation error.
Extrapolation frontiers define limits of generative crystal modelling
Scientists have established output scale as a primary evaluation criterion for geometric generative models of crystalline materials. A new benchmark, termed RADII, systematically measures the limits of these models by assessing performance across a range of nanoparticle sizes, from 55 to 11,298 atoms.
This radius-resolved approach reveals a critical structure size, termed the extrapolation frontier, beyond which model outputs become unreliable, a phenomenon previously unquantified. Benchmarking five current architectures revealed that all models exhibit degradation in global positional error when extrapolating beyond their training radii.
However, the decline in local bond fidelity varied considerably between architectures, ranging from minimal to substantial collapse. Importantly, each architecture demonstrated a unique failure sequence, indicating that the extrapolation frontier is a complex, multi-dimensional characteristic shaped by the specific model family.
Well-performing models adhered to a power-law scaling exponent, allowing for accurate prediction of out-of-distribution error based on in-distribution performance. The authors acknowledge limitations including the use of idealized reference structures and a constrained computational budget. The current benchmark focuses on a specific size range of nanoparticles and ten selected materials, not encompassing the full breadth of inorganic chemistry.
Future research will expand the benchmark to include data derived from density functional theory calculations, explore unconditioned evaluation methods, and investigate size-conditioned training strategies to extend the reliable operating range of these generative models. Further work will also assess the applicability of the observed scaling laws to other material types, such as proteins and amorphous solids.
👉 More information
🗞 How Far Can You Grow? Characterizing the Extrapolation Frontier of Graph Generative Models for Materials Science
🧠 ArXiv: https://arxiv.org/abs/2602.09309
