Scientists are tackling the challenge of efficiently computing numerous small singular value decompositions (SVDs), a fundamental operation powering techniques like principal component analysis and low-rank approximation. Ahmad Abdelfattah from the University of Tennessee, Knoxville, and Massimiliano Fasi from the University of Leeds, alongside their colleagues, present a new GPU-oriented batch SVD solver that significantly advances performance in this critical area. Their research is important because current GPU solutions lag behind CPU capabilities for ‘batch SVD’ , solving many small SVD problems simultaneously , and this novel approach, leveraging the one-sided Jacobi algorithm and a series of clever optimisations, delivers unmatched speed and robustness across diverse problem types and hardware. Numerical experiments demonstrate substantial performance gains over both vendor-supplied and open-source alternatives, promising to accelerate a wide range of data-intensive applications.
Numerical experiments demonstrate substantial performance gains over both vendor-supplied and open-source alternatives, promising to accelerate a wide range of data-intensive applications.
GPU Accelerated Batch SVD via Jacobi Algorithm
While efficient CPU-based solutions already exist, achieving comparable performance on GPUs has remained a significant challenge until now. This approach avoids the computationally expensive bi-diagonalisation phase common in other SVD algorithms, offering a simpler and more readily parallelisable solution. This allows for significant acceleration, particularly when dealing with large batches of small SVD problems. Furthermore, the study establishes that the new solver is not only faster but also numerically stable, maintaining accuracy across a wide spectrum of matrix characteristics. The solver supports both real and complex matrices in both 32-bit and 64-bit floating-point arithmetic, providing versatility for various scientific and engineering applications. This advancement is particularly relevant for data science, machine learning, and image processing, where SVD is a fundamental building block for dimensionality reduction, data compression, and Feature extraction. The publicly available implementation through the open-source MAGMA library ensures broad accessibility and facilitates further research and development in this critical area of numerical computation.
GPU Accelerated Batch SVD via Jacobi Optimisation offers
These optimizations focused on maximizing the utilization of GPU resources and minimizing data transfer overhead, critical factors for achieving speedups. The system delivers a robust solution across a wide range of problem characteristics, ensuring reliable results regardless of input data. Scientists harnessed the relationship between the SVD and the Hermitian eigenvalue problem, specifically the equations AHA = VΣ²VH and AAH = UΣ²UH, to form the algorithmic foundation of their design. This connection allowed them to reduce the SVD computation to solving a Hermitian eigenvalue problem, streamlining the process and enabling efficient parallelization.
The technique reveals that solving for the eigenvectors of these Gram matrices provides the right and left singular vectors of the original matrix A, with the eigenvalues corresponding to the squares of the singular values. Furthermore, the research pioneered a method for recovering the left singular vectors (U) from the relation U = AVΣ⁻¹, once the right singular vectors (V) and singular values (Σ) are known. This innovative solver achieves unmatched performance in batch SVD computations, offering a substantial advancement in numerical linear algebra for GPU platforms.
GPU SVD Solver Achieves Peak Performance
Results demonstrate that the new implementation consistently outperforms both vendor solutions and existing open-source solvers on NVIDIA and other systems. Specifically, the solver’s performance was benchmarked against established methods, revealing significant speedups in processing batch SVD problems. The core of this achievement lies in the exploitation of parallel execution inherent in the Jacobi method, where off-diagonal elements corresponding to disjoint pairs can be annihilated concurrently. Experiments confirm global convergence for row- and column-cyclic ordering, where each non-diagonal element of the matrix is systematically eliminated.
Hansen’s work on equivalence classes of cyclic orderings was leveraged, ensuring that any strategy within the same class also guarantees global convergence. The researchers implemented a round-robin ordering, as depicted in Figure 0.1, which consists of seven iterations, each containing four disjoint pairs of block-columns for an 8×8 matrix. This ordering, along with others like ring and odd-even orderings, contributes to the solver’s efficiency and stability. Data shows the one-sided Jacobi SVD algorithm operates on the right Gram matrix AHA, avoiding explicit computation and instead applying Jacobi rotations implicitly.
Algorithm 2 details the unblocked one-sided Jacobi SVD algorithm, iteratively refining the matrix until convergence is achieved. The algorithm computes singular values and normalizes left singular vectors, sorting the singular values in descending order to ensure accuracy and stability. Measurements confirm that each step of the Jacobi algorithm reduces the off-diagonal norm, guaranteeing convergence to a matrix with orthogonal columns, provided the initial conditions are met. The breakthrough delivers a simpler implementation for parallel architectures, requiring partitioning across columns only and eliminating the need for a QR factorization pre-processing step.
GPU Accelerated Batch SVD via Jacobi
The resulting solver supports standard LAPACK data types and computes singular values alongside both left and right singular vectors. The authors highlight that Jacobi-type methods can be effective for small to medium-sized problems, and minimizing data movement through kernel fusion and register reuse is crucial for accelerator performance. Acknowledging limitations, the authors note the solver’s performance is most pronounced for smaller and medium-sized matrices. Future research could extend these design principles to other iterative factorization algorithms, potentially broadening the impact of these findings. This work not only delivers a competitive batch SVD solver but also establishes valuable design principles for batch linear algebra on accelerators, contributing to advancements in high-performance computing.
👉 More information
🗞 An Efficient Batch Solver for the Singular Value Decomposition on GPUs
🧠 ArXiv: https://arxiv.org/abs/2601.17979
