Researchers are tackling the challenge of efficiently deploying large language models with FlexLLM, a new composable High-Level Synthesis (HLS) library. Led by Jiahao Zhang, Zifan He, and Nicholas Fraser from the University of California, Los Angeles and AMD, alongside Michaela Blott, Yizhou Sun, and Jason Cong et al, this work significantly streamlines the development of custom LLM accelerators, achieving a complete inference system for the Llama-3.2 1B model in just two months with remarkably concise code. FlexLLM’s innovative approach to stage-customisation and quantization unlocks hybrid designs offering substantial performance and energy efficiency gains, demonstrated by a 1.29x speedup and 3.14x better energy efficiency compared to an NVIDIA A100 GPU on a U280 FPGA , paving the way for more accessible and powerful LLM applications.

FlexLLM accelerates LLM development via customizable HLS implementations

Scientists have unveiled FlexLLM, a composable High-Level Synthesis (HLS) library designed to dramatically accelerate the development of domain-specific Large Language Model (LLM) accelerators. The team achieved a complete inference system for the Llama-3.2 1B model in under two months, utilising only 1,000 lines of code, demonstrating the library’s efficiency and ease of use. FlexLLM distinguishes itself by exposing key architectural degrees of freedom, enabling hybrid designs that intelligently tailor temporal reuse and spatial dataflow for both prefill and decode stages of LLM inference, a critical innovation for optimising performance. This approach allows for customisation, addressing the conflicting optimisation goals inherent in these two distinct stages, and incorporates a comprehensive quantization suite for accurate low-bit deployment, pushing the boundaries of efficient LLM acceleration.
The research establishes a stage-customized accelerator with a hardware-efficient quantization technique achieving a WikiText-2 Perplexity (PPL) of 12.68, surpassing the SpinQuant baseline and maintaining high accuracy even with reduced bit-widths. Furthermore, the team integrated a Hierarchical Memory Transformer (HMT) plug-in, designed to efficiently handle long-context processing, a significant challenge in current LLM architectures. Experiments conducted on the AMD U280 FPGA at 16nm revealed the accelerator achieves a 1.29x end-to-end speedup, 1.64x higher decode throughput, and 3. Projected results on the V80 FPGA at 7nm anticipate even greater gains, reaching 4.71x, 6.55x, and 4.13x improvements respectively.

In scenarios demanding long-context processing, the integration of the HMT plug-in demonstrably reduces prefill latency by 23.23x and extends the context window by 64x. This translates to 1.10x/4.86x lower end-to-end latency and 5.21x/6. By bridging algorithmic innovation in LLM inference with high-performance accelerator design, FlexLLM minimises manual effort and accelerates the development cycle, paving the way for more accessible and efficient LLM deployments across diverse platforms. The work opens exciting possibilities for customisable, energy-efficient LLM solutions tailored to specific applications and environments.

Scientists Method

Scientists unveiled FlexLLM, a composable High-Level Synthesis (HLS) library designed for the rapid development of domain-specific Large Language Model (LLM) accelerators. This innovative library empowers researchers to build stage-customized inference systems, enabling hybrid designs that optimise temporal reuse and spatial dataflow for both prefill and decode stages, a significant departure from unified-design paradigms. FlexLLM incorporates a comprehensive quantization suite, supporting accurate low-bit deployment and offering the most advanced quantization support currently available among LLM accelerator frameworks. The team engineered a complete inference system for the Llama-3.2 1B model in under two months, utilising only 1,000 lines of code.

This system features a stage-customized accelerator with hardware-efficient quantization, achieving a WikiText-2 Perplexity (PPL) of 12.68, surpassing the baseline SpinQuant performance of 13.30. Furthermore, the study pioneered a Hierarchical Memory Transformer (HMT) plug-in to facilitate efficient processing of long-context sequences, addressing a critical limitation in current LLM architectures. Experiments employed the U280 FPGA at 16nm, demonstrating that the accelerator achieves a 1.29x end-to-end speedup, 1.64x higher decode throughput, and 3.71x speedup, 6.55x higher decode throughput, and 4.13x improved energy efficiency.

The researchers meticulously profiled compute and memory bandwidth utilisation during prefill and decode stages of the BF16 Llama-3. Integrating the HMT plug-in demonstrably reduces prefill latency by 23.23x and extends the context window by 64x, delivering 1.10x/4.86x lower end-to-end latency and 5.21x/6.27x higher energy efficiency on the U280/V80 compared to the A100 baseline, with minimal resource overhead (less than 7.5%) and latency impact (0.6%). FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators, significantly reducing manual effort and enabling rapid prototyping of novel LLM acceleration techniques.

FlexLLM accelerates Llama-3.2 1B model in under two months, utilising only 1,000 lines of code, demonstrating the library’s efficiency. This system incorporates a stage-customized accelerator featuring hardware-efficient quantization, achieving a WikiText-2 Perplexity (PPL) of 12.68 and surpassing the performance of the SpinQuant baseline. Additionally, a Hierarchical Memory Transformer (HMT) plug-in was integrated for enhanced long-context processing capabilities.

Experiments conducted on the AMD U280 FPGA at 16nm revealed the accelerator achieves a 1. The team measured a 1.64× higher decode throughput and a remarkable 3.14× improvement in energy efficiency using the same comparison point. Projected results on the V80 FPGA at 7nm indicate even greater performance gains, reaching 4.71× speedup, 6.55× higher decode throughput, and 4.13× better energy efficiency. These measurements confirm substantial improvements in computational performance and power consumption. Further investigation into long-context scenarios demonstrated the HMT plug-in’s effectiveness, reducing prefill latency by 23.23× and extending the context window by 64×.

Integrating the HMT plug-in delivered 1.10×/4.86× lower end-to-end latency and 5.21×/6. Data shows the system effectively addresses the divergent compute and memory behaviours inherent in LLM prefill and decode stages, optimising each for its specific requirements. The breakthrough delivers a flexible architecture capable of stage-customisation, allowing for tailored temporal reuse and spatial dataflow for both prefill and decode processes. Researchers recorded that FlexLLM bridges the gap between algorithmic innovation in LLM inference and the development of high-performance accelerators with minimal manual effort, paving the way for faster and more efficient LLM deployment. This work represents a significant step towards creating domain-specific accelerators that can meet the growing demands of LLMs across diverse deployment environments.

FlexLLM rapidly builds efficient LLM accelerators

Scientists have developed FlexLLM, a composable High-Level Synthesis (HLS) library designed for the rapid creation of domain-specific Large Language Model (LLM) accelerators. This framework allows for customisation of accelerator architecture for different stages of inference, enabling hybrid designs that optimise both temporal reuse and spatial dataflow for prefill and decode processes. FlexLLM also incorporates a comprehensive quantization suite, supporting accurate low-bit deployment of LLMs. Researchers successfully used FlexLLM to build a complete inference system for the Llama-3.2 1B model in under two months with only 1,000 lines of code.

The resulting system features a stage-customized accelerator with improved hardware-efficient quantization, and a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing, demonstrating a 1.29x end-to-end speedup, 1.64x higher decode throughput, and 3.23% and extended the context window by 64 tokens, achieving up to 1.10x/4.86x lower latency and 5.21x/6. The authors acknowledge that their work focuses on a specific model size (Llama-3.2 1B) and hardware platforms (U280 and V80 FPGAs), which may limit the direct generalisability of their results Future research directions include extending FlexLLM to support larger models and diverse hardware platforms, as well as exploring its application to next-generation Transformer model. This work establishes a foundation for domain-specific LLM acceleration and composable HLS methodologies, effectively bridging algorithmic innovation with high-performance hardware implementation with minimal manual effort.

👉 More information
🗞 FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design
🧠 ArXiv: https://arxiv.org/abs/2601.15710

Tags:

dataflow FlexLLM FPGA acceleration Hierarchical Memory Transformer High-Level Synthesis LLaMA-3.2 LLM accelerators Quantization SpinQuant! stage-customized inference

Flexllm Achieves 12.68 Wikitext-2 PPL with Novel LLM Accelerator Design

FlexLLM accelerates LLM development via customizable HLS implementations

Scientists Method

FlexLLM rapidly builds efficient LLM accelerators

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability