Scientists are increasingly focused on optimising quantum circuit simulations, a computationally intensive task crucial for advancing quantum computing. Ruimin Shi, Gabin Schieffer, and Pei-Hung Lin, from KTH Royal Institute of Technology and Lawrence Livermore National Laboratory, alongside Maya Gokhale, Andreas Herten, and Ivy Peng et al., demonstrate a vector-length agnostic (VLA) design for high-performance simulations on ARM processors. This research is significant because it addresses the challenge of portability across diverse vector architectures, achieving substantial speedups of up to 4.5x on the Fujitsu A64FX processor, and providing valuable insights into optimising future VLA designs for quantum computation. Their implementation within Google’s Qsim and evaluation on NVIDIA Grace and AWS Graviton3 processors showcases the potential of flexible vectorisation for accelerating state-vector simulations.
This work addresses a critical limitation in high-performance computing: the lack of portability across different vector architectures.
Traditional vectorization techniques require code modifications when moving to hardware with varying vector lengths, hindering application maintenance and scalability. The newly proposed VLA design utilises a fixed instruction set to support flexible vector lengths, ranging from 128 to 2048 bits on ARM processors and potentially up to 16,384 bits on RISC-V systems.
This eliminates the need for code rewriting when targeting different hardware platforms. The core of this breakthrough lies in a single-source implementation of quantum simulations within Google’s Qsim, coupled with targeted optimisation techniques. These include VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation.
By carefully managing memory access and computational intensity, the researchers overcame inherent challenges associated with VLA, such as the trade-off between scalar and predicated vectorisation. These results highlight the potential of VLA to unlock performance benefits in demanding computational workloads. Furthermore, the team defined new metrics and performance monitoring unit (PMU) events to quantify vectorisation activities, providing valuable insights for future VLA designs and compiler optimisations. This work not only advances the field of quantum simulation but also establishes a pathway towards more portable and efficient high-performance computing applications.
Quantum Circuit Simulation and Vectorisation Efficiency on ARM Processors
Google’s Qsim quantum simulator served as the foundation for this work, a Schrödinger full state-vector simulator highly tuned to single-precision (FP32) arithmetic. A key methodological innovation was the development of new metrics and performance monitoring unit (PMU) events to quantify vectorization activities.
These metrics enabled detailed analysis of how effectively the VLA design utilised the vector processing capabilities of each processor. The implementation incorporated VLEN-adaptive layout adjustment, a technique to optimise data arrangement based on the vector length supported by the target hardware.
Load buffering was also employed to prefetch data and reduce memory access latency, while fine-grained loop control was implemented to maximise parallelism. Further optimisation involved gate fusion-based arithmetic intensity adaptation, which combined multiple quantum gate operations into single, more computationally intensive operations.
This approach reduced loop overhead and improved data locality. Scalability was demonstrated with simulations extending to 288 threads on a Jupiter supercomputer node, highlighting the potential for further performance gains through increased parallelism.
Performance enhancement via scalable vector extensions and optimised memory access patterns
A 4.5x speedup in quantum state-vector simulations was achieved on the A64FX processor utilising a new vector-length agnostic design. This work presents a single-source implementation that also delivered a 2.5x speedup on the NVIDIA Grace processor and a 1.5x speedup on the AWS Graviton processor. Performance gains were evaluated across five quantum circuits, each containing up to 36 qubits, demonstrating substantial acceleration of this critical computational workload.
The research identified inefficiencies in current compiler support for vector-length agnostic auto-vectorization within quantum simulations. Analysis of assembly code revealed that strided vector memory loads, resulting in interleaved memory access, hindered effective hardware support and limited performance improvements from auto-vectorization.
Consequently, SVE intrinsics were employed to directly exploit the vector units and overcome these limitations. A generic vector-length agnostic design was proposed for quantum state-vector simulations, incorporating VLEN-adaptive memory access, buffering, and fine-grained loop control. Gate fusion was implemented as a system-specific optimisation, adapting arithmetic intensity to balance performance on each target platform according to the roofline model.
Scalability was demonstrated with simulations reaching up to 288 threads on a Jupiter supercomputer node. Lightweight profiling and PMU event measurement were leveraged to quantify vectorization activities. Average active vector length, instruction reduction ratio, and memory backend stalls were measured to provide insights for future vector-length agnostic designs. These metrics facilitated a deeper understanding of the performance characteristics and bottlenecks within the simulations.
Quantum simulation performance scaling across diverse ARM vector architectures
Vector-length agnostic architectures, such as those represented by ARM SVE and RISC-V vector extension, are gaining prominence in processor design. This work investigates the potential for high-performance portability within these architectures by applying a vector-length agnostic design to quantum state-vector simulations, a computationally demanding task.
A new implementation was developed within Google’s Qsim simulator, incorporating optimisation techniques including VLEN-adaptive memory access, buffering, and fine-grained loop control to maximise performance across different vector lengths. New metrics and performance monitoring unit events were defined to better understand vectorisation activity and inform future VLA designs. The authors acknowledge that current compiler support for auto-vectorisation remains a limitation, influencing the overall performance achieved.
Future research should focus on enhancing compiler capabilities to fully exploit the potential of VLA architectures. Further investigation into the scalability of these techniques to larger quantum circuits and exploration of their application to other computational workloads are also warranted. These results establish a clear path toward achieving high-performance portability in emerging vector architectures for quantum simulation and potentially other demanding applications.
👉 More information
🗞 High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors
🧠 ArXiv: https://arxiv.org/abs/2602.09604
