The paper discusses Analog-Based In-Memory Computing (AIMC) inference accelerators used for Deep Neural Network (DNN) inference workloads. It proposes two Post-Training (PT) optimization methods to improve accuracy and reduce complexity during training. These methods are applied to the pre-trained RoBERTa transformer model for all General Language Understanding Evaluation (GLUE) benchmark tasks. The paper also discusses Hardware-Aware (HWA) training techniques and the use of the GLUE benchmark and RoBERTa model. The study uses a hardware model for simulations based on an array containing 1 million Phase-Change Memory (PCM) devices.
Introduction to Analog-Based In-Memory Computing (AIMC) Accelerators
Analog-Based In-Memory Computing (AIMC) inference accelerators are used to efficiently execute Deep Neural Network (DNN) inference workloads. However, to mitigate accuracy losses due to circuit and device non-idealities, Hardware-Aware (HWA) training methodologies must be employed. These methodologies typically require significant information about the underlying hardware.
Post-Training Optimization Methods
In this paper, two Post-Training (PT) optimization methods are proposed to improve accuracy after training is performed. The first method optimizes the conductance range of each column in a crossbar, and the second optimizes the input, i.e., Digital-to-Analog Converter (DAC) range. It is demonstrated that when these methods are employed, the complexity during training and the amount of information about the underlying hardware can be reduced with no notable change in accuracy.
Application of PT Optimization Methods
The PT optimization methods are applied to the pretrained RoBERTa transformer model for all General Language Understanding Evaluation (GLUE) benchmark tasks. The results show that further optimizing learned parameters post-training improves accuracy.
Analog-Based In-Memory Computing (AIMC) Accelerators
AIMC accelerators are capable of performing Vector-Matrix Multiplications (VMMs) in O(1)-time complexity. They have gained significant interest due to their ability to execute these operations efficiently. However, networks trained to be deployed on traditional compute hardware with Floating-Point (FP) precision parameters, such as Graphics Processor Units (GPUs), require retraining when deployed on AIMC hardware to achieve near or at-ISO accuracy.
Hardware-Aware Training Techniques
Hardware-Aware (HWA) training techniques, such as Quantization-Aware Training (QAT), are widely adopted due to the proliferation of reduced precision digital accelerators and deterministic execution flows. However, some of these techniques require instance-specific information and cannot easily be generalized for different hardware architectures.
The GLUE Benchmark and RoBERTa
The GLUE benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. In this paper, the base RoBERTa model, a ubiquitous bidirectional transformer based on BERT with approximately 125M learnable parameters, is used.
Hardware Model for Simulations
To perform realistic hardware simulations, an experimentally verified model calibrated based on extensive measurements performed on an array containing 1 million Phase-Change Memory (PCM) devices is used. A tile size of 512×512 is assumed. Inputs are encoded using 8-bit Pulse Width Modulation (PWM) and weights are represented using a standard differential weight mapping scheme.
Improving the Accuracy of Analog-Based In-Memory Computing Accelerators Post-Training” by Corey Lammie, Athanasios Vasilopoulos, Julian Büchel, Giacomo Camposampiero, Manuel Le Gallo, Malte J. Rasch, Abu Sebastian. Published on January 18, 2024. https://doi.org/10.48550/arxiv.2401.09859
