Researchers are increasingly assessing the capacity of large language models (LLMs) to reason about complex scientific topics, yet a systematic evaluation of their conceptual understanding remains largely absent. Afane, Laufer and Wei, from Fordham University, alongside Mao, Farooq and Wang et al., address this deficiency with Quantum-Audit, a novel benchmark comprising 2,700 questions designed to probe LLM reasoning on quantum computing. This work is significant because it moves beyond evaluating code synthesis to directly measure conceptual grasp, utilising both expert-authored and LLM-generated questions, including those with deliberately false premises. Their findings reveal that while top-performing models, such as Claude Opus 4.5, can surpass average human expert scores, performance diminishes on challenging topics and crucially, models often fail to identify and correct flawed assumptions, highlighting a critical limitation in their reasoning abilities.

Evaluating Reasoning in Large Language Models for Quantum Computing

Introducing the Quantum-Audit Benchmark for LLMs

Researchers have developed a comprehensive benchmark, Quantum-Audit, to rigorously evaluate the reasoning capabilities of large language models (LLMs) in the complex field of quantum computing. This new assessment addresses a critical gap in understanding how well these AI systems grasp specialised quantum concepts, moving beyond evaluations of code generation and circuit design.

Quantum-Audit comprises 2,700 questions spanning core quantum computing topics, designed to challenge LLMs with varying levels of difficulty and reasoning demands. The benchmark incorporates expert-written questions, those extracted from research papers and validated by experts, open-ended prompts, and crucially, questions containing false premises to test critical thinking skills.

Initial Model Performance Against Quantum Experts

Human participants achieved scores ranging from 23% to 86%, with quantum computing experts averaging 74% accuracy on the assessment. Top-performing models, including Claude Opus 4.5, surpassed this expert average, reaching 84% accuracy overall. However, a notable 12-point accuracy drop was observed when these same models tackled questions originally authored by human experts, suggesting a potential bias towards the style of LLM-generated content.

Performance diminished further when evaluating understanding of advanced topics, with accuracy on security-related questions falling to 73%. Furthermore, the study revealed a significant weakness in the models’ ability to identify and correct false information. Accuracy on questions deliberately containing incorrect assumptions fell below 66%, indicating a tendency to accept and reinforce flawed reasoning.

Defining the Future of AI in Quantum Research

This comprehensive evaluation, encompassing 26 leading LLMs and a multilingual subset in Spanish and French, establishes a new standard for assessing the reliability of AI assistance in quantum computing education and research. The work highlights both the promise and the limitations of current LLMs, paving the way for improved models capable of nuanced understanding and accurate reasoning in this rapidly evolving field.

Construction and validation of the quantum computing benchmark Audit

Constructing the Comprehensive 2,700-Question Benchmark

A 2,700-question benchmark, termed Audit, was constructed to systematically measure the conceptual understanding of large language models in quantum computing. The benchmark comprises 1,000 questions authored by quantum computing experts and a further 1,000 questions automatically generated by LLMs from research papers, subsequently validated by human experts.

An additional 700 questions were included to assess more nuanced reasoning skills, specifically 350 open-ended questions demanding detailed explanations and 350 questions deliberately containing false premises to test critical reasoning abilities. Human participants achieved scores ranging from 23% to 86%, with experts averaging 74% accuracy on the benchmark questions.

The study evaluated 26 models from leading organizations, comparing their performance against 43 quantum computing experts and practitioners to establish baseline human capabilities. Performance was assessed across seven core quantum computing topics: quantum algorithms, error correction, security protocols, distributed computing, quantum machine learning, gates and circuits, and foundational concepts.

Multi-dimensional analysis was performed, incorporating the open-ended and false-premise questions to provide a comprehensive evaluation of model reasoning and error detection. A multilingual subset of the benchmark, including questions in Spanish and French, was also used to assess cross-lingual performance and identify potential language-specific biases.

The methodology innovatively combined expert-authored questions with LLM-generated questions, ensuring a broad coverage of quantum computing concepts and challenging the models with both established knowledge and current research. The inclusion of false-premise questions represented a novel approach to evaluating critical reasoning, probing the models’ ability to identify and correct inaccuracies rather than simply recalling facts. This rigorous evaluation framework enabled the identification of systematic failures in advanced topics, such as quantum security, despite strong performance on foundational concepts, and highlighted significant performance degradation in multilingual settings.

Large language model performance on the Audit quantum computing benchmark

Claude Opus 4.5 achieved 84% accuracy on the 2,700-question quantum computing benchmark, Audit, while GPT-5.2 Pro reached 83.75% and Claude Sonnet 4.5 attained 83.30%. The benchmark comprises 1,000 expert-written questions, 1,000 questions generated by large language models and validated by experts, and an additional 700 questions designed to assess critical reasoning skills.

Human experts averaged 74% on the benchmark, demonstrating that top-performing models surpassed human performance levels. A notable performance disparity emerged between question types, with models consistently scoring lower on expert-written questions. Claude Opus 4.5 and GPT-5.2 Pro achieved 78.40% and 77.70% respectively on expert-written questions, compared to 89.60% and 89.80% on LLM-extracted questions.

This 12-point average accuracy drop suggests that expert-authored questions require a deeper level of reasoning or test concepts less prevalent in typical training data. Performance further declined when evaluating topic-specific knowledge, particularly in the area of quantum security. On basic quantum computing concepts, leading models exceeded 90% accuracy, indicating a strong grasp of foundational principles.

However, accuracy on quantum algorithms decreased to between 80% and 82%, and dropped to approximately 74% on security questions. These security questions assessed understanding of advanced topics such as phase mismatch attacks, crosstalk-based attacks, QubitHammer techniques, and quantum backdoor insertion methods, representing rapidly evolving research areas with limited literature. The benchmark also assessed critical reasoning, revealing that model accuracy fell below 66% on questions containing false premises, indicating a frequent tendency to accept and reinforce incorrect assumptions.

Large language model proficiency with quantum computing concepts and critical reasoning

Researchers have developed Quantum-Audit, a comprehensive benchmark comprising 2,700 questions designed to assess the understanding of core quantum computing concepts by large language models. The benchmark includes questions originating from expert knowledge, research papers validated by experts, open-ended prompts, and questions containing false premises intended to test critical reasoning abilities.

Evaluation of 26 models revealed that while top-performing models, such as Claude Opus 4.5, can exceed average human expert scores, achieving 84% accuracy overall, performance diminishes on more complex topics and question formats. Specifically, models demonstrated a notable decline in accuracy when addressing quantum security questions, falling to 73%, and struggled with questions containing false premises, achieving less than 66% accuracy in identifying and correcting erroneous assumptions.

Expert-written questions proved more challenging for models than those generated using other language models, resulting in a 10 to 15 percentage point difference in scores. Multilingual testing, encompassing Spanish and French, indicated that leading models maintained reasonable performance, though smaller models exhibited reduced accuracy.

These findings highlight both the advancements and limitations of current large language models in specialised domains like quantum computing, and underscore the need for continued rigorous evaluation as the field progresses. The authors acknowledge that accuracy, while a direct measure of knowledge, could be complemented by metrics assessing calibration and semantic similarity.

Future research will focus on expanding the benchmark’s multilingual coverage to include more languages and diverse source materials, providing a more comprehensive assessment of cross-lingual performance. This work establishes a valuable resource for evaluating language models in quantum computing, revealing a tendency to accept false premises and demonstrating a performance gap between basic and advanced concepts.

👉 More information
🗞 Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
🧠 ArXiv: https://arxiv.org/abs/2602.10092

Tags:

Language Models

Ai’s Quantum Knowledge Tested: Models Fail 77% of Core Concept Questions

Evaluating Reasoning in Large Language Models for Quantum Computing

Introducing the Quantum-Audit Benchmark for LLMs

Initial Model Performance Against Quantum Experts

Defining the Future of AI in Quantum Research

Construction and validation of the quantum computing benchmark Audit

Constructing the Comprehensive 2,700-Question Benchmark

Large language model performance on the Audit quantum computing benchmark

Large language model proficiency with quantum computing concepts and critical reasoning

Rohail T.

Latest Posts by Rohail T.:

Scalable Phonon Lasers Overcome Limitations for Focused Vibrational Control

Microstructure Predicts Qubit Coherence, Reducing Decoherence Loss by Two Orders of Magnitude

Fewer Atoms Needed: Light Emission Scales with One Divided by N Cubed