High-order stencil computations underpin many critical scientific simulations, yet achieving optimal performance on modern multicore processors remains a significant challenge. Yinuo Wang, Tianqi Mao, and Lin Gan, alongside their colleagues, address this issue by exploring the potential of matrix units within CPUs to accelerate these complex calculations. Their work introduces MMStencil, a novel approach that combines algorithmic and memory optimizations with a new multi-threading paradigm, designed to overcome the limitations of traditional data access and sharing. The team demonstrates that MMStencil consistently achieves high hardware utilization and, crucially, outperforms state-of-the-art GPU libraries on demanding tasks, including reverse-time migration applications where it yields a substantial 1.8x speedup compared to a highly optimized GPU implementation. This advance promises to unlock significant performance gains for a wide range of high-performance computing applications.

High-Order Stencil Computations Challenge Acceleration Methods

Stencil computations are fundamental to many areas of high-performance computing, underpinning simulations in fields like weather forecasting, fluid dynamics, and crucially, earth modeling for seismic analysis. These computations involve updating values on a grid based on the values of neighboring cells, and are often limited by how efficiently data can be moved and processed. Three-dimensional, high-order stencils, which demand calculations across many neighboring cells for increased accuracy, present a particularly significant challenge, requiring substantial computational power and memory bandwidth. Existing approaches to accelerate these computations, particularly those leveraging specialized matrix acceleration units, have largely focused on two-dimensional problems or have struggled to deliver substantial gains in three dimensions.

Prior work has explored transforming stencil calculations into matrix operations, but these methods often fall short in practice, particularly for complex, three-dimensional scenarios. Reproducibility studies have shown that these approaches don’t consistently outperform traditional methods, and often fail to demonstrate significant speedups in real-world applications. Achieving true application-level speedups requires addressing integration challenges within larger HPC workflows. To address these limitations, researchers have turned to new-generation multicore processors featuring advanced matrix units and integrated, high-bandwidth memory.

These processors offer both the computational power and memory bandwidth necessary for efficient three-dimensional, high-order stencil computations, with a multi-level cache hierarchy and a low-latency matrix computation mechanism. The team presents MMStencil, a novel matrix-based stencil solution designed to optimize three-dimensional, high-order stencils and real-world applications by harnessing the capabilities of this new processor architecture. By carefully tailoring the computation to the processor’s strengths, MMStencil achieves significant performance improvements, outperforming state-of-the-art libraries on leading graphics processing units by up to 2.1 times, and enabling a 1.8x speedup in reverse time migration, a crucial technique in seismic analysis.

CPUs Rival GPUs for Stencil Calculations

Researchers have developed a new method for accelerating complex scientific simulations, particularly those involving three-dimensional data and high-order stencils, achieving significant performance gains over existing approaches. These stencils are fundamental to many areas of scientific computing, including seismic imaging and materials science. Traditionally, graphics processing units (GPUs) have been the dominant force in accelerating these calculations, but this work demonstrates the potential of modern, multicore central processing units (CPUs) to rival, and even surpass, GPU performance. The team focused on optimizing stencil computations, which often suffer from inefficient use of hardware resources and limitations in memory access.

Their innovation, named MMStencil, leverages the specialized matrix processing capabilities now integrated into advanced CPUs, alongside a suite of algorithmic and architectural optimizations. By carefully tailoring algorithms to the matrix unit, and addressing challenges related to data sharing and memory bandwidth, MMStencil sustains high levels of hardware utilization across a variety of stencil shapes and dimensions. A key breakthrough lies in MMStencil’s ability to efficiently handle the large datasets and complex calculations inherent in real-world applications. High-order stencils, which improve accuracy by considering a wider range of data points, demand substantial memory and computational power.

The researchers demonstrate that by reducing the amount of data needed, achieving the same accuracy with fewer grid points, and optimizing data flow within the processor, they can significantly reduce computational costs. The results demonstrate a substantial performance improvement over state-of-the-art GPU implementations, achieving speedups of up to 2.1x in benchmark tests. More importantly, this translates directly into real-world applications, such as reverse time migration (RTM). In RTM simulations, MMStencil achieved an 1.8x speedup compared to a highly optimized GPU implementation, and a combined speedup of up to 3.5x when combined with parallel optimizations. This advancement promises to accelerate scientific discovery in fields reliant on complex simulations, offering a compelling alternative to traditional GPU-based approaches.

Optimised Stencil Computations Surpass GPU Performance

MMStencil presents a comprehensive optimisation framework for three-dimensional, high-order stencil computations on modern multicore CPUs equipped with matrix units. The research demonstrates that by carefully tuning microarchitectural elements, transforming memory layouts, optimising multi-threaded scheduling, and implementing NUMA-aware communication, significant performance gains are achievable. Results show MMStencil outperforms state-of-the-art libraries on A100 GPUs by up to 2.1x, and delivers an 1.8x speedup in real-world reverse time migration applications. The study highlights that while compiler-generated code performs well on simpler stencil patterns, more complex three-dimensional kernels require targeted optimisation. A key finding is that the high compute throughput of matrix units shifts the performance bottleneck from computation to memory bandwidth, suggesting future research should focus on memory efficiency and prefetching strategies. Ultimately, the work demonstrates that CPUs, when equipped with matrix accelerators and on-package memory, can rival the performance of GPUs in stencil computations, offering potential for similar gains across a wider range of HPC codes on next-generation platforms.

👉 More information
🗞 MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit
🧠 DOI: https://doi.org/10.48550/arXiv.2507.11067

Tags:

3D high-order stencils DMA Gpgpu HPC matrix units matrix-accelerated stencil computation multi-thread parallelism NUMA RTM SIMD

Quantum News

Matrix Acceleration Optimises Performance of 3D High-Order Stencil Computation

High-Order Stencil Computations Challenge Acceleration Methods

CPUs Rival GPUs for Stencil Calculations

Optimised Stencil Computations Surpass GPU Performance

Latest Posts by Quantum News:

SpaceX Prepares Initial Public Offering

ANELLO Photonics Partners with Q-CTRL to Address GPS-Denied Environments

IBM Reports High Failure Rate for Generative AI Pilots