Intel GPUs Enhance Machine Learning Performance with Fully-Fused MLPs Implementation

Intel Gpus Enhance Machine Learning Performance With Fully-Fused Mlps Implementation

The article discusses the role of MultiLayer Perceptrons (MLPs) in Machine Learning (ML) and Artificial Intelligence (AI), and their implementation on Intel GPUs. MLPs, characterized by their fully connected layers, are universal approximators, capable of approximating any continuous function. The article also explains the implementation of fully-fused MLPs on Intel GPUs, which involves fusing layers into a single kernel to keep relevant data in faster memories.

The SYCL implementation for Intel GPUs focuses on MLPs with arbitrary depth and fixed layer width. The fully-fused MLPs approach significantly increases the arithmetic intensity and performance, outperforming the Intel Extension for PyTorch (IPEX) and the CUDA PyTorch version on Nvidia’s H100 GPU.

What are Fully-fused MultiLayer Perceptrons and their Role in Machine Learning?

MultiLayer Perceptrons (MLPs) are a crucial component in the landscape of Machine Learning (ML) and Artificial Intelligence (AI). They are used as the primary Neural Network architecture for several ML applications. MLPs are characterized by their fully connected layers, where every neuron in a layer is connected to every neuron in the preceding and succeeding layer. A unique feature of MLPs is that each neuron’s output is independent of its neighbors in the same layer, making it suitable for fully-fusing operations.

MLPs are particularly interesting because they are universal approximators, meaning they can approximate any continuous function to any desired accuracy. This is proven by the Universal Approximation Theory for width-bounded networks. In practice, MLPs rarely exceed the maximum width of 27 elements supported by this work, as networks tend to be deeper to gain more expressiveness rather than wide.

How are Fully-fused MLPs Implemented on Intel GPUs?

The implementation of fully-fused MLPs on Intel GPUs focuses on narrow MLPs, which consist of an arbitrary number of layers (depth) and a small and constant number of neurons per layer (width). These narrow MLPs are of particular interest because their theoretical peak performance is severely limited by the small width of the layers. Their small width results in a reduced arithmetic intensity of the requisite matrix-multiplications in each layer for the training and, in particular, the inference.

To alleviate the issues arising from the low arithmetic intensity and the memory bandwidth of the global memory, a common strategy is the fusion of the layers into a single kernel to keep relevant data in faster memories, i.e., the register file, shared memory, or faster caches. This approach, termed fully-fused MLPs, has been implemented for Nvidia GPUs utilizing Nvidia’s CUDA language.

What is the SYCL Implementation for Intel GPUs?

The SYCL implementation for Intel GPUs of fully-fused MLPs focuses on MLPs with arbitrary depth and fixed layer width of 2^i, i = 4, 7 neurons in each layer. This implementation is based on Intel’s joint matrix SYCL extension to utilize the XMX hardware in Intel’s Data Center GPU Max 1550, which is the device targeted with the optimized implementation.

This method is especially well-suited to optimize the training and inference performance for models that require large data throughput with batch sizes 2^i, 15 ≤ i ≤ N, since those sizes maximize the occupancy of the device. The SYCL implementation on Intel hardware has improved performance over an equivalent CUDA implementation for MLPs with width 64 by a factor up to 2.84 in inference and 1.75 in training.

How does Fully-fused MLPs Improve Performance?

The approach to fully-fused MLPs is especially well-suited for the acceleration of the inference and significantly increases the arithmetic intensity and thus the theoretical peak performance compared to the approach shown in 6 by reducing the accesses to global memory. This approach is demonstrated on a regression benchmark and the following three applications: Image Compression, Neural Radiance Fields (NeRFs), and Physics-Informed Machine Learning.

In all cases, the implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 3.0 and the CUDA PyTorch version on Nvidia’s H100 GPU by up to a factor 1.9.

What are the Contributions of this Paper?

The paper presents the first SYCL implementation of fully-fused MLPs applied on Intel. It demonstrates the performance improvements and potential applications for the implementation. The paper also showcases the efficiency of the SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, the implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 3.0 and the CUDA PyTorch version on Nvidia’s H100 GPU by up to a factor 1.9.

Publication details: “Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs”
Publication Date: 2024-03-26
Authors: Kai Yuan, Christoph Bauinger, Xiangyi Zhang, Pascal Baehr, et al.
Source: arXiv (Cornell University)
DOI: https://doi.org/10.48550/arxiv.2403.17607