Multi-gpu Quantum Circuit Simulation, Enabled by MPI, Benchmarks HPC System Performance

Classical simulation presents a significant challenge in the pursuit of quantum computing, demanding substantial computational resources, yet remaining essential for algorithm development and hardware validation. W. Michael Brown from NVIDIA, Anurag Ramesh from Purdue University, and Thomas Lubinski from Quantum Circuits Inc., alongside their colleagues, address a critical bottleneck in multi-GPU simulations, the speed of communication between GPUs. Their work introduces a standardized benchmarking approach using MPI within the QED-C Application-Oriented Benchmarks, allowing for rigorous evaluation of interconnect performance. The team demonstrates that recent advances in interconnect technology, exemplified by the Grace Blackwell NVL72 architecture, have yielded over sixteen-fold performance improvements in simulation time, exceeding the gains achieved through improvements in GPU architecture alone, and paving the way for more complex and realistic quantum algorithm simulations.

Statevector Simulation Speed and Scalability

Scientists continually improve the performance of quantum computer simulation, particularly statevector simulation, essential for algorithm development and testing given current hardware limitations. Performance is assessed by simulation speed, qubit handling capacity, and resource utilization, with growing attention to energy efficiency. Key projects include HamLib, Qandle, SV-Sim, and Grafeyn, which focus on techniques like gate-matrix caching, circuit splitting, and parallel sparse simulation to accelerate simulations. The field benefits from standardized benchmarks that allow for fair comparison of different approaches and hardware platforms.

Simulation predominantly utilizes CPU and GPU-based systems, increasingly employing multi-node clusters. NVIDIA and AMD are major players, with successive GPU architectures and platforms like NERSC’s Perlmutter architecture providing powerful resources. High-speed interconnects, including NVLink, UCX, and SHMEM, are critical for scaling simulations, while efficient memory management relies on sparse matrix representations to reduce memory footprint. Statevector simulation represents quantum circuits as vectors, a computationally intensive process that scales exponentially with qubit count. Optimizations include sparse matrix representations, circuit splitting, gate-matrix caching, lazy evaluation, parallelization, and careful attention to memory access patterns.

Algorithms like Trotter decomposition, Suzuki decomposition, and quantum phase estimation are frequently employed, and programming frameworks like CUDA and PGAS facilitate distributed computing. The field is moving towards exascale computing, leveraging supercomputers like Perlmutter for quantum simulation. Reducing energy consumption is a growing priority, and distributed simulation is essential for scaling to larger qubit numbers. Hybrid approaches, combining statevector simulation with techniques like tensor networks, offer potential performance gains. Optimizing qubit ordering can also improve simulation efficiency. Key players in this evolving field include NVIDIA, AMD, NERSC, and projects like HamLib, Qandle, SV-Sim, and Grafeyn. The overarching goal is to push the boundaries of quantum computer simulation, addressing challenges related to scalability, efficiency, and energy consumption.

MPI Benchmarking for Scalable Quantum Simulations

Scientists have developed a methodology for benchmarking quantum algorithms by integrating the Message Passing Interface (MPI) into the QED-C Application-Oriented Benchmarks. This allows for scalable simulations on high-performance computing systems, addressing the challenge of simulating increasingly complex quantum systems with limited resources. Recognizing that multi-GPU simulations are often limited by communication bottlenecks, the team focused on optimized parallelization to expand simulation scale beyond the memory capacity of a single GPU. This implementation-agnostic framework supports performance assessment across various quantum programming frameworks, including Qiskit, PennyLane, Cirq, and CUDA-Q, making it broadly applicable to both near-term simulation and the development of fault-tolerant quantum systems.

Experiments utilized state-vector simulation, a versatile method, to assess performance, acknowledging its exponential scaling in memory and computational requirements with increasing qubit count. The team demonstrated that while improvements in GPU architecture have yielded speedups, advances in interconnect performance have had a significantly larger impact, delivering substantial improvements in time to solution for multi-GPU simulations. This highlights the critical role of interconnect technology in overcoming communication bottlenecks and enabling the simulation of increasingly complex quantum systems.

Interconnect Performance Drives Quantum Algorithm Speedups

Researchers have achieved substantial performance gains in simulating quantum algorithms by focusing on the interconnectivity between processing units. The work demonstrates that improvements in interconnect performance have yielded over sixteen times greater improvements in time to solution for multi-GPU simulations, exceeding the speedups achieved through advancements in GPU architecture alone. The team meticulously measured the peak bidirectional bandwidth of various interconnects crucial for multi-GPU communication.

Results show that NVLink 5 delivers an impressive 1800 GB/s, significantly surpassing older technologies like PCIe 4. 0 and even NVLink 3. The study details performance across several interconnects, including MI350X Infinity Fabric and ConnectX 7, providing a comprehensive analysis of data transfer capabilities. To validate these interconnect improvements, researchers employed the quantum phase estimation (QPE) benchmark and a 33-qubit Transverse-field Ising model. The QED-C implementation of QPE allows for precise fidelity comparisons, while the Ising model provides a more complex strong-scaling test. These benchmarks were used to record key performance metrics, including average execution times, circuit depths, and fidelity between measured and ideal phase distributions, demonstrating the practical benefits of enhanced interconnectivity for quantum computing workloads.

Faster Quantum Simulation With Improved Hardware

Researchers have demonstrated substantial performance gains in simulating quantum circuits through improvements in both hardware and communication protocols. Over the past four years, single-GPU quantum circuit simulation has improved more than fourfold with successive generations of NVIDIA GPUs. Crucially, the integration of Message Passing Interface (MPI) into benchmarking tools has enabled simulations with significantly higher qubit counts and reduced solution times on high-performance computing systems. The team’s results reveal that advances in interconnect technology have had a particularly large impact, delivering over sixteen times faster performance in multi-GPU simulations when comparing current systems to those of three years ago.

Enabling Remote Direct Memory Access (RDMA) configurations is now considered best practice, especially where network latency is a concern. Furthermore, the use of Multi-Node NVLink (MNNVL) represents a significant advancement for state-vector simulation, and users should enable this setting for optimal performance. While gate fusion techniques effectively utilize GPU resources, a gap remains between peak main memory bandwidth and inter-GPU interconnect speeds. Researchers anticipate that future optimizations for concurrent communications will be challenging due to increased memory contention, but systems with fast, coherent interconnects, such as Genesis, offer potential for sophisticated algorithms.

👉 More information
🗞 Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance
🧠 ArXiv: https://arxiv.org/abs/2511.14664

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Machine Learning Achieves Cloud Cover Prediction Matching Classical Neural Networks

Quantum Machine Learning Achieves Cloud Cover Prediction Matching Classical Neural Networks

December 22, 2025
Nitrogen-vacancy Centers Advance Vibronic Coupling Understanding Via Multimode Jahn-Teller Effect Study

Nitrogen-vacancy Centers Advance Vibronic Coupling Understanding Via Multimode Jahn-Teller Effect Study

December 22, 2025
Second-order Optical Susceptibility Advances Material Characterization with Perturbative Calculations

Second-order Optical Susceptibility Advances Material Characterization with Perturbative Calculations

December 22, 2025