The ability of artificial intelligence to tackle complex, open-ended problems remains a significant challenge, particularly when moving beyond curated datasets and into the realm of active scientific research. To address this, Minhui Zhu, Minyang Tian, Xiaocheng Yang, and colleagues at various institutions introduce a new benchmark, CritPt, the Complex Research using Integrated Thinking, Physics Test. This innovative assessment probes the reasoning capabilities of large language models using unpublished, research-level physics problems contributed by over fifty active researchers, spanning diverse fields from condensed matter physics to astrophysics. The team demonstrates that current state-of-the-art models, even those equipped with coding tools, struggle with these full-scale research challenges, achieving only modest accuracy. CritPt offers a realistic and standardised evaluation, highlighting a substantial gap between current AI capabilities and the demands of genuine scientific inquiry, and provides a valuable tool for guiding the development of more scientifically grounded artificial intelligence.
Topological Quantum Materials and Exotic Phases
This research area encompasses a diverse range of topics spanning physics, materials science, computer science, and astronomy. A dominant theme is condensed matter physics and materials science, investigating topological materials like topological insulators and crystalline insulators, alongside two-dimensional materials such as graphene and tungsten diselenide. Researchers explore quantum materials exhibiting exotic phases, including quantum spin liquids and Majorana metals, focusing on emergent quantum phenomena and fractionalization, and studying strongly correlated electron systems where electron interactions play a crucial role. Investigations extend to europium-based materials and the analysis of collective charge excitations using techniques like resonant inelastic x-ray scattering and electron energy loss spectroscopy.
Understanding the effects of disorder and localization in materials, and characterizing their magnetic properties and spin excitations, are also key areas of focus. This work also connects to astrophysics and cosmology, particularly in the search for dark matter, employing gravitational wave detectors and interferometers to detect signals from axions, dark photons, and scalar field dark matter. Theoretical studies delve into models like the Kitaev model, used to understand quantum spin liquids, and explore general quantum phenomena and excitons in materials. This research benefits from advanced experimental techniques including various forms of spectroscopy, neutron diffraction, and interferometry. They collaborated with over 50 active physics experts to create 71 complex challenges simulating full-scale research projects, alongside 190 modular checkpoints for detailed analysis. All problems are newly created, grounded in the experts’ ongoing research, ensuring relevance to current frontier physics and avoiding reliance on textbook exercises. Strict design criteria were implemented to prevent inflated performance due to memorization or data contamination, ensuring all problems are unpublished and hand-curated.
Problems demand solutions beyond simple recall, requiring arrays of floating-point numbers and complex symbolic expressions, making shortcut guessing ineffective. Researchers constructed problems to be self-contained and well-defined, with answers verifiable without external resources. A physics-informed, scalable auto-grading pipeline was developed to overcome the resource intensity of traditional expert-based grading. CritPt distinguishes itself by focusing on unpublished, research-level challenges spanning diverse areas of modern physics, including condensed matter physics, astrophysics, and nuclear physics. The benchmark comprises 71 complex, composite challenges simulating full-scale research projects, alongside 190 modular checkpoints offering more granular insights into specific subtasks. All problems were created by over 50 active physics researchers, grounded in their own ongoing research, ensuring realistic and relevant reasoning demands.
Strict design criteria were implemented to prevent contamination from training data and shortcut guessing. Problems are hand-curated to admit machine-verifiable answers, such as arrays of floating-point numbers and complex symbolic expressions, and are designed to be resistant to simple guessing strategies. A physics-informed auto-grading pipeline was developed to efficiently and accurately assess LLM responses, moving beyond simple final-answer checking to evaluate reasoning steps. Initial evaluations of several leading LLMs, including GPT-5 and Claude Opus 4, reveal limited performance on these complex challenges, with the best average accuracy achieved by base models only around 4.
0%, rising to approximately 10% when equipped with coding tools. These results demonstrate a significant disconnect between current LLM capabilities and the demands of realistic physics research. CritPt provides a standardized and challenging framework for assessing LLM performance in this critical area, offering valuable guidance for the development of scientifically grounded AI tools and highlighting areas where further advancements are needed to meaningfully assist physicists with frontier research.
CritPt Benchmark Evaluates Physics Reasoning Capabilities
The development of the CritPt benchmark represents a significant advance in evaluating the reasoning capabilities of large language models within the complex domain of physics research. Researchers created CritPt, comprising 71 comprehensive challenges and 190 individual tasks, specifically designed to mirror the demands of authentic, open-ended physics investigations across a broad range of specializations. These challenges were not based on textbook problems but were newly authored by over 50 active physics researchers, grounded in their current research, and rigorously vetted for accuracy and clarity. The benchmark employs an automated grading system capable of handling the specialized output formats common in physics, ensuring reliable evaluation of model responses.
Results demonstrate a considerable gap between the performance of current state-of-the-art language models and the requirements of genuine physics research. While models achieve limited success on isolated tasks, their accuracy on full research-scale challenges remains low, with the best performing model achieving only around 10% accuracy when equipped with coding tools. CritPt provides a valuable resource for guiding the development of AI tools specifically tailored to assist physicists, offering actionable feedback for developers and a standardized platform for measuring progress in this challenging area.
👉 More information
🗞 Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
🧠 ArXiv: https://arxiv.org/abs/2509.26574
