The increasing need for efficient processing of long sequences on devices like smartphones and embedded systems drives innovation in artificial intelligence models, and State Space Models, such as Mamba, offer a promising solution due to their computational advantages. Linfeng Zhong, Songqiang Xu, and Huifeng Wen, from the Institute for Artificial Intelligence at Peking University, along with colleagues Tong Xie, Qingyu Guo, and Yuan Wang, have overcome significant hurdles in accelerating Mamba’s performance using a technique called speculative decoding. Their work introduces SpecMamba, a novel accelerator designed for Field Programmable Gate Arrays (FPGAs), which combines innovative system design, algorithmic improvements, and custom hardware architecture. This co-design approach achieves a substantial 2. 27x speedup compared to current GPU-based methods and a 2. 85x improvement over previous FPGA implementations, while also dramatically increasing energy efficiency, marking a significant step forward in deploying powerful AI models on edge devices.
Researchers are exploring speculative decoding, a technique that uses draft model generation and target model verification to accelerate processing. Directly applying speculative decoding to State Space Models presents challenges related to hidden state management, incompatibility with parallel verification, and inefficient hardware workload distribution. Speculative decoding generates a draft sequence of tokens quickly, then verifies that draft sequence using the full, more accurate LLM. If the verification passes, the draft is accepted, saving significant computation; otherwise, the full LLM generates the correct token. Many approaches use a tree structure to explore multiple draft sequences in parallel, increasing the probability of a successful draft and reducing latency.
Implementing these techniques on FPGAs achieves higher performance and energy efficiency compared to CPUs or GPUs, exploring techniques like smaller models for draft generation, speculative sampling, and Snakes and Ladders. Verification methods involve full LLM verification and the use of dedicated verifier models. Experiments demonstrate significant performance improvements, with speedups in LLM inference latency often ranging from 2x to 10x or more, improved energy efficiency, increased throughput, and scalability to handle larger LLMs and workloads. FPGA implementations consistently outperform CPU and GPU implementations in terms of performance and energy efficiency for LLM inference. The key takeaways are that speculative decoding is a promising technique for accelerating LLM inference, FPGA implementations offer significant performance and energy efficiency advantages, and hardware-software co-design is crucial for achieving optimal performance.
SpecMamba Accelerates State Space Models with Speculation
Researchers have developed SpecMamba, a groundbreaking FPGA-based accelerator that significantly boosts the performance of Mamba, a state-of-the-art State Space Model, by incorporating speculative decoding. This work addresses key challenges in adapting speculative decoding to the sequential nature of Mamba models, particularly for deployment on edge devices. The team overcame difficulties related to hidden state recovery and incompatibility with tree-based parallel verification, and hardware workload imbalances to achieve substantial improvements in both speed and energy efficiency. At the core of SpecMamba is a novel memory-aware hybrid backtracking strategy, which intelligently combines off-chip storage for draft model states with on-chip caching of target model activations.
This system-level optimization is paired with a new algorithm, FIFO-based tree verification with tiling, that enables efficient full-tree verification within Mamba’s sequential processing framework. Furthermore, the hardware architecture is customized with a linear parallel SSM sequential dataflow, scheduling linear layer computations in parallel and SSM computations in series to maximize hardware utilization and overlapping of operations. Experiments demonstrate that SpecMamba achieves a 2. 27x speedup compared to leading GPU baselines and a 5. 41x improvement in energy efficiency.
Compared to previous FPGA-based solutions, SpecMamba delivers a 2. 85x increase in execution speed and a 1. 26x improvement in energy efficiency. Specifically, on an AMD VCK190 FPGA platform with DDR memory, the system processes Mamba2-2. 7B at a rate of 313 tokens per second, while on a VHK158 platform with HBM memory, it achieves 276 tokens per second. These results confirm that SpecMamba represents a significant advancement in accelerating Mamba models for resource-constrained edge computing applications.
FPGA Accelerates Speculative State Space Model Inference
SpecMamba represents a significant advancement in efficient sequence modeling for edge devices, specifically addressing the computational demands of State Space Models like Mamba. Researchers developed the first FPGA-based accelerator capable of performing Mamba inference with speculative decoding, a technique that speeds up processing by generating draft results and verifying them. The team’s approach involved co-design at the system, algorithm, and hardware levels.
They implemented a memory-aware hybrid backtracking strategy to coordinate the draft and target models, a FIFO-based tree verification scheme to enable full candidate token verification, and a dataflow architecture that balances parallel and sequential processing. Implemented on FPGA platforms, SpecMamba achieves a 2. 27x speedup compared to GPU baselines and a 2. 85x improvement over previous FPGA designs, while also demonstrating substantially higher energy efficiency. These results validate the potential of State Space Models for high-throughput edge applications and establish a new paradigm for co-designing sequential state-space architectures.
👉 More information
🗞 SpecMamba: Accelerating Mamba Inference on FPGA with Speculative Decoding
🧠 ArXiv: https://arxiv.org/abs/2509.19873
