The lack of a unified standard for tensor operations presents a significant challenge to efficient and portable scientific computing. Jan Brandejs of the Université de Toulouse, Niklas Hörnblad from Umeå University, and Edward F. Valeev of Virginia Tech, alongside colleagues, address this issue with the introduction of Tensor Algebra Processing Primitives (TAPP). This C-based interface decouples application code from underlying hardware, promising improved performance portability and reduced dependency conflicts. The researchers provide a rigorous mathematical formulation of tensor contractions alongside a reference implementation, ensuring accuracy and enabling validation of optimised kernels. Demonstrations integrating TAPP with libraries such as TBLIS, cuTENSOR, and DIRAC highlight the potential of this community-driven standard to streamline tensor-based computations.

Algebra Background

The increasing prevalence of tensor algebra in diverse scientific domains necessitates a standardised approach to tensor operations. The central objective is to define a minimal set of primitive operations that can be combined to express a wide range of tensor algebra routines, promoting code reuse and optimisation. The research adopts a bottom-up approach, beginning with an analysis of common tensor operations found in applications such as quantum chemistry, machine learning, and high-performance computing.

This analysis identified a core set of 18 primitives, encompassing operations like tensor contraction, element-wise addition, and reshaping. These primitives were then formalised with precise semantics and interfaces, ensuring unambiguous implementation and interoperability. Performance evaluations were conducted on representative hardware, including CPUs and GPUs, to demonstrate the efficiency of TAPP-based implementations. Specific contributions include a formal specification of the 18 TAPP primitives, complete with detailed descriptions of their inputs, outputs, and computational behaviour, alongside a reference implementation utilising both CPU and GPU backends.

Tensor Contraction Efficiency for Scientific Modelling

Tensor operations are fundamental to rapidly developing fields such as artificial intelligence and quantum science modelling, with tensor contraction being the most computationally intensive operation. The efficiency of tensor contraction directly impacts progress in areas including material science, quantum chemistry, drug discovery and life sciences, as it dictates the feasible size of models used in these disciplines. Improvements in contraction performance have been crucial for advancements in deep learning, quantum computing simulations, tensor network methods and various quantum chemistry techniques. Despite its importance, tensor contraction software lacks the maturity and organisation seen in matrix operation software, exhibiting fragmentation due to a diverse and growing developer base.

A recent survey revealed a proliferation of scattered libraries and significant code duplication, largely attributable to the absence of standardised tensor operation primitives. Standardisation would enable modular code development and library reuse, mitigating the challenges of hard-to-replace dependencies. This work represents an initial step towards establishing a standard for tensor contraction, defining the problem precisely and proposing a standard interface alongside a reference implementation. The approach draws inspiration from the Basic Linear Algebra Subroutines (BLAS), which have served as the de facto standard for linear algebra operations for over four decades. The evolution of BLAS, from Level 1 focusing on scalar and vector operations to Level 2 addressing memory hierarchy bottlenecks, highlights the importance of adapting standards to leverage advancements in hardware and optimise computational efficiency. This work addresses a critical gap in the field by establishing a standard interface, enabling performance portability and resolving dependency challenges for tensor algebra. The team developed a mathematical formulation of tensor contractions alongside a reference implementation, ensuring correctness and providing a validation tool for optimized kernels. Experiments demonstrate the viability of TAPP through successful integrations with established libraries including TBLIS, cuTENSOR, and the DIRAC package.

The interface’s execution API is naturally type-agnostic, allowing for flexibility in data types, although mixed-type support ultimately depends on the capabilities of the underlying back-end implementation. Researchers observed that the same tensor operation description can be reused for subsequent executions with the same or different data, encouraging efficient caching and reducing computational overhead. Measurements confirm the adaptability of TAPP through the implementation of virtual key-value stores (VKVs), a free-form mechanism for providing information to the underlying implementation, allowing for customized initialization and data locality information. The interface also incorporates a robust error handling system, returning integral error codes with plain text descriptions analogous to the POSIX strerror function.

The reference implementation prioritizes correctness and simplicity over performance, serving as a model for developers supporting TAPP. This implementation supports arbitrary mixing of real and complex floating-point numbers with 64, 32, and 16-bit widths, though other implementations may be more restrictive. Through a mathematically defined framework and reference implementation, the researchers aimed to establish a performance-portable standard, addressing a long-standing need within the scientific computing community. Successful integration with existing libraries , TBLIS, cuTENSOR, and DIRAC , demonstrates the feasibility and potential of TAPP to function as a unifying layer for tensor algebra. By providing a common interface, TAPP facilitates the development of applications that can leverage the performance benefits of different tensor libraries without requiring extensive code modifications.

The authors acknowledge that TAPP intentionally focuses on performance-critical operations, excluding convenience functionalities like in-memory array reshaping to maintain efficiency. Future development plans include the creation of a comprehensive, randomised benchmark suite and the implementation of “Multi-TAPP”, a feature enabling dynamic selection of tensor libraries at runtime, allowing developers to seamlessly benchmark and choose the optimal library for specific tasks. The authors also note that the current work represents a draft standard, maintained by a dedicated decision-making committee to ensure ongoing relevance and evolution.

👉 More information
🗞 Tensor Algebra Processing Primitives (TAPP): Towards a Standard for Tensor Operations
🧠 ArXiv: https://arxiv.org/abs/2601.07827

Tags:

C-based interface cuTENSOR DIRAC package. optimised kernels Performance Portability TAPP TBLIS tensor algebra Tensor Contractions

Tapp Standard Enables Performance Portability for Tensor Operations with C-Based Interface

Algebra Background

Tensor Contraction Efficiency for Scientific Modelling

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks