Matrix Accelerators, also known as Tensor Cores and Matrix Cores, are crucial for high-performance computing and machine learning due to their high performance and low power consumption. However, the lack of public documentation about their attributes poses challenges for programmers, especially when porting codes across GPUs. To address this, a collection of tests based on the IEEE floating-point standard has been developed to identify feature differences that affect computed results. This methodology, which can be applied to future GPUs, aims to help programmers port code across matrix accelerators more reliably.
What are Matrix Accelerators and Why are They Important?
Matrix Accelerators, also known as Tensor Cores by NVIDIA and Matrix Cores by AMD, are of growing interest in high-performance computing (HPC) and machine learning (ML) due to their high performance at low power consumption. These accelerators are indispensable for achieving today’s performance levels in ML. For instance, language-level models such as ChatGPT would not have been possible without Tensor Cores. ML training would take at least ten times longer without the acceleration of Tensor Cores.
Matrix Accelerators have also caught the attention of HPC designers who see its four times speedup with 80% less energy consumption as a real avenue toward much faster and energy-efficient codes. However, the literature mainly describes matrix accelerators with respect to their usage and not numerical properties. With the growing number of these units in upcoming GPUs, this situation poses a serious impediment to those wanting to use them.
What are the Challenges in Using Matrix Accelerators?
Unfortunately, very few facts are publicly documented about some of the attributes of Matrix Accelerators that can affect answers computed on identical code. Examples of such features are the number of extra precision bits, accumulation order of addition, and predictable subnormal number handling during computations. The lack of information on how these features differ across two matrix accelerators can make it impossible to reliably port codes across GPUs containing these differing accelerators.
In the realm of GPU-based accelerators, programmers are interested in testing codes developed for NVIDIA GPUs on AMD GPUs that are becoming available. However, documentation about many aspects of these GPUs is found seriously lacking in terms of numerical aspects. Unanswered questions not only pertain to particular behaviors such as precision loss for a specific operator but also important features such as the rounding modes supported, fused-multiply-addition (FMA) details, the number of extra precision bits held inside, the granularity of their block fused-multiply-add, etc.
How Can We Address These Challenges?
In response to this challenge, this paper offers a collection of tests that are based on a precise understanding of the IEEE floating-point standard as well as previously discovered formal results about the impact of floating-point features on numerical behavior. By running these tests on a large number of widely used and recent GPUs, we show that our tests can unearth feature differences that affect computed results.
We exhibit these differences across five floating-point formats, four standard rounding modes, and additional four feature combinations including those relating to rounding and preservation of extra precision bits. This extensive testing demonstrates the versatility of our tests in picking up salient differences that can affect numerical behavior across this space.
What are the Implications of These Findings?
As further proof of the discriminative power of our approach, we design a simple matrix-multiplication test with the matrix entries designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers. There is no prior work that shows that a simple test like this can produce three different answers on five different platforms, raising concern that one carefully understands Matrix Accelerator features before porting code across them.
This work is designed to help programmers port code across matrix accelerators more reliably based on the commonality of features that our tests help confirm. It is highly desirable to have a set of straightforward tests that can quickly pick up salient feature differences between GPUs, but these do not exist. Our primary contribution in this paper is a rigorous methodology that has enabled us to create such discriminatory tests for NVIDIA and AMD GPUs with the methodology generalizable and applicable to future GPUs.
Conclusion
We are in an era of rising computing hardware heterogeneity where many new CPU and GPU components are introduced in rapid succession and are fueling performance advances in HPC and ML from drug discovery to climate simulations and beyond. While no scientist aims to achieve higher performance at the expense of correctness, ensuring correctness has become a serious challenge given the sheer number of hardware units and the rapidity of their adoption. This paper provides a rigorous methodology for creating discriminatory tests for NVIDIA and AMD GPUs, which is generalizable and applicable to future GPUs.
Publication details: “FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD
Matrix Accelerators”
Publication Date: 2024-02-29
Authors: Xinyi Li, Ang Li, Bo Fang, Katarzyna Ćwirydowicz, et al.
Source: arXiv (Cornell University)
DOI: https://doi.org/10.48550/arxiv.2403.00232
