The increasing power of large language models extends to code generation, but their true capabilities in this specialised area remain largely unexplored. Xiaoyu Guo, Minggu Wang, and Jianjun Zhao, all from Kyushu University, address this gap by introducing QuanBench, a new benchmark designed to rigorously evaluate these models on quantum code generation. QuanBench comprises 44 diverse programming tasks, spanning essential areas of quantum computation such as algorithms and circuit construction, and assesses both functional correctness and semantic accuracy. Their evaluation of several recent language models reveals a significant limitation, with current systems achieving less than 40% accuracy and frequently producing code with semantic errors, establishing a crucial baseline for future advancements in this rapidly evolving field.

QSpark Benchmark Evaluates Quantum Code Generation

Scientists have developed QSpark, a new benchmark to rigorously assess the ability of large language models (LLMs) to generate quantum code, addressing a critical gap in evaluating performance beyond standard programming tasks. The benchmark focuses specifically on the Qiskit framework and includes a diverse set of quantum algorithms, such as Deutsch-Jozsa, Grover’s, and the Variational Quantum Eigensolver. QSpark tests LLMs’ ability to produce correct, reliable, and efficient Qiskit code, evaluating functional correctness, code reliability across varying inputs, and optimization of quantum resource usage. The team evaluated prominent LLMs including Claude, Codegemma, Deepseek-Coder, GPT-4, Qwen2.

5-Coder, and Starcoder, employing techniques like reinforcement learning and instruction tuning with human feedback to refine instruction following and reasoning capabilities. Temperature scaling was also used to control the randomness and creativity of the generated code. This work addresses a significant gap in evaluating LLMs for specialized domains like quantum computing, offering a standardized benchmark for researchers and developers and contributing to the advancement of quantum software development.

LLMs Struggle with Quantum Code Generation

8%, indicating substantial room for improvement in semantic correctness. The study meticulously assessed model outputs using both functional correctness and quantum semantic equivalence, revealing that while some models occasionally generate correct quantum code, most solve only a subset of the 44 tasks. Analysis of common failure patterns reveals issues such as outdated API usage, errors in circuit construction, and incorrect algorithm logic, highlighting key limitations in current LLM capabilities. Researchers evaluated nine recent LLMs, including general-purpose and code-specialized models, establishing a systematic method for evaluating LLMs on quantum programming tasks.

Quantum Code Generation Benchmarking Reveals Limitations

This work introduces QuanBench, a new benchmark designed to rigorously evaluate the ability of large language models to generate code for quantum algorithms. The benchmark comprises 44 programming tasks based on the widely used Qiskit framework, covering areas such as Grover’s search, the quantum Fourier transform, and state preparation. Each task is assessed using both functional correctness and semantic equivalence to canonical solutions, providing a comprehensive measure of performance. Evaluation of several state-of-the-art language models reveals limited capabilities in generating correct quantum code, with overall accuracy below 40% and significant semantic errors observed.

Analysis of the results identifies common failure modes, including the use of outdated APIs, structural inconsistencies in generated code, and misinterpretations of algorithmic logic. While models like DeepSeek R1 and Gemini 2. 5 show some emerging strengths, a substantial performance gap remains, particularly for tasks demanding deeper quantum reasoning. The authors acknowledge that current models struggle with complex quantum tasks and suggest several avenues for future research, including expanding the benchmark to encompass additional quantum frameworks and exploring techniques like prompt engineering and fine-tuning. The QuanBench benchmark itself is publicly available to facilitate further investigation and progress in this field.

👉 More information
🗞 QuanBench: Benchmarking Quantum Code Generation with Large Language Models
🧠 ArXiv: https://arxiv.org/abs/2510.16779

Tags:

algorithm logic circuit construction code generation functional correctness Gate Decomposition Large Language Models Pass@k QuanBench semantic equivalence State Preparation

Quanbench Evaluation Reveals Current Large Language Models Achieve below 40% Accuracy in Quantum Code Generation

QSpark Benchmark Evaluates Quantum Code Generation

LLMs Struggle with Quantum Code Generation

Quantum Code Generation Benchmarking Reveals Limitations

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently