Scientists are tackling a major bottleneck in quantum computation: the classical processing required by Sample-based Diagonalisation (SQD) algorithms. Jun Doi, Tomonori Shirakawa, and Yukio Kawashima, alongside colleagues at IBM Quantum and RIKEN’s Center for Computational Science, have developed a significantly faster implementation of Selected Basis Diagonalisation (SBD) , a crucial step within SQD , by harnessing the power of modern GPUs. Their innovative approach, utilising the Thrust library, restructures key computational elements for efficient parallel processing, achieving speedups of up to 40x compared to traditional CPU execution. This advancement promises to substantially reduce the runtime of SQD iterations, paving the way for more complex and efficient quantum simulations and accelerating progress in the field.
GPU Acceleration of Selected Basis Diagonalization improves performance
The study unveils a GPU-native SBD implementation that efficiently exploits modern GPU architectures, offering a simple, portable, and high-performance foundation for accelerating SQD-based workflows. Experiments demonstrate that this approach substantially reduces the total runtime of SQD iterations, particularly for large-scale calculations involving configuration spaces containing up to 10⁸, 10¹⁰ determinants. By leveraging the Thrust library, the researchers bypassed directive-based approaches, gaining finer control over memory access, data-parallel execution, and persistent device-resident computation. This careful optimization allows for sustained high performance even with both structured and sparse Hamiltonians, broadening the applicability of the method.
This research establishes a unified framework supporting both half-bitstring and full-bitstring representations, enhancing the flexibility of the classical diagonalization backend for SQD. The team restructured excitation evaluation and flattened configuration data structures to facilitate this versatility, paving the way for support of arbitrary spin symmetries and more flexible operator forms in future iterations. The work opens possibilities for scaling hybrid quantum-classical algorithms to chemically meaningful system sizes, as GPU-based HPC systems become increasingly prevalent. Furthermore, the implementation prioritizes a matrix-free approach, avoiding the explicit formation of the Hamiltonian matrix to conserve memory, which is critical for handling the immense configuration spaces encountered in SQD. The Hamiltonian application is optimized by focusing on the sparse excitation neighborhood of each configuration, leveraging the Slater, Condon rules to reduce computational cost. The research team focused on optimising the classical diagonalization step, which currently dominates the runtime of SQD calculations, particularly for systems requiring high chemical accuracy with configuration spaces reaching 10⁸, 10¹⁰ determinants. Experiments employed the Thrust library to construct a fully GPU-native SBD backend, enabling precise control over memory access and data-parallel execution on modern GPU architectures. The study pioneered a restructuring of key SBD components, configuration processing, excitation generation, and matrix-vector operations, around fine-grained, data-parallel primitives and flattened, GPU-friendly data layouts.
This innovative approach facilitates efficient exploitation of GPU parallelism, moving away from directive-based methods and towards a fully GPU-resident computation strategy. Researchers engineered the system to support both half-bitstring and full-bitstring representations within a unified framework, anticipating future flexibility in Hamiltonian evaluation strategies and operator forms. This design choice allows for seamless extension to accommodate diverse systems and symmetries. Experiments demonstrated that the Thrust-based SBD achieves up to a 40× speedup over CPU execution, significantly reducing the total runtime of SQD iterations.
The team harnessed GPU-native parallel algorithms to optimise performance-critical components, including the evaluation of Hamiltonian matrix-vector products, which scale directly with the size of the reduced basis. This method achieves substantial performance gains by minimising data transfer between the CPU and GPU, and by maximising thread concurrency on the GPU. Furthermore, the approach enables a simple, portable, and high-performance foundation for accelerating SQD-based quantum, classical workflows, paving the way for scaling hybrid algorithms to chemically meaningful system sizes. The innovative data layouts and excitation iterators were deliberately structured to facilitate future extensions, ensuring the SBD implementation remains adaptable to evolving SQD requirements and increasingly complex Hamiltonian models. The research focuses on accelerating the computationally intensive classical diagonalization step within SQD, which currently dominates the overall runtime, particularly for accurate ground-state calculations of systems containing up to 10⁸, 10¹⁰ determinants. Experiments revealed that this new Thrust-based SBD achieves a remarkable speedup of up to 40× compared to traditional CPU execution, substantially reducing the total time required for SQD iterations. The team measured performance gains by restructuring key components of SBD, including configuration processing, excitation generation, and matrix-vector operations, around fine-grained data-parallel primitives and GPU-friendly data layouts.
Results demonstrate that by leveraging modern GPU architectures, the implementation efficiently exploits parallelism and minimizes computational bottlenecks. Specifically, the work details a shift from directive-based approaches to a fully GPU-native SBD backend, providing precise control over memory access and data-parallel execution. This allows for persistent computation directly on the GPU, avoiding costly data transfers. Measurements confirm that the GPU-optimized SBD supports both half-bitstring and full-bitstring representations within a unified framework, increasing its versatility for diverse Hamiltonian evaluation strategies.
The breakthrough delivers an efficient classical backend for next-generation SQD workflows, capable of handling both structured and sparse Hamiltonians. Tests prove the implementation’s ability to sustain high performance across a broader range of systems, paving the way for scaling hybrid quantum-classical algorithms to chemically meaningful system sizes. Data shows that the flattened configuration data structures and restructured excitation evaluation are critical to achieving these performance improvements. The research successfully addresses the need for GPU-specialized diagonalization capabilities as GPU-based HPC systems become increasingly prevalent. The team’s work establishes a simple, portable, and high-performance foundation for accelerating SQD-based quantum, classical workflows, promising significant advancements in quantum chemistry simulations and related fields.
GPU Acceleration Boosts Sampled Diagonalization Performance significantly
Scientists have developed a GPU-accelerated implementation of Selected Basis Diagonalization (SBD) using the Thrust library, significantly enhancing the performance of Sample-based Diagonalization (SQD) workflows. By restructuring core components, including configuration processing and matrix-vector operations, around parallel data processing and efficient data layouts, the researchers effectively harnessed modern GPU architectures. Experiments revealed up to a 40-fold speedup compared to CPU execution, substantially reducing the total runtime of SQD iterations. This achievement demonstrates that GPU-native parallel processing offers a simple, portable, and high-performance solution for accelerating SQD-based classical computations.
The implementation maintains compatibility with both half-bitstring and full-bitstring representations, and a modular kernel organization promotes maintainability and portability across evolving GPU architectures. Evaluations on the Miyabi-G GH200 cluster showed a 35, 39-fold per-node speedup across 1 to 16 nodes, enabling chemically relevant SQD workloads, such as the Fe4S4 problem, to complete within the memory and runtime constraints of current GPU nodes. The authors acknowledge a limitation in that the performance gains are most pronounced when the SQD iteration time is dominated by Davidson steps, rather than single-kernel latency. Future work could explore extending these techniques to other quantum-classical algorithms and further optimising performance on emerging GPU hardware.
👉 More information
🗞 GPU-Accelerated Selected Basis Diagonalization with Thrust for SQD-based Algorithms
🧠 ArXiv: https://arxiv.org/abs/2601.16637
