Ghidorah, a new large language model inference system, accelerates on-device processing by exploiting speculative decoding and distributing workloads across heterogeneous processing units. The system’s hetero-core model parallelism and architecture-aware profiling achieved up to a 7.6x speedup in decoding on Jetson NX, optimising sparse computation on ARM CPUs.
The increasing demand for on-device artificial intelligence necessitates efficient methods for deploying large language models (LLMs) directly on end-user hardware. Current limitations in memory bandwidth often restrict the potential of modern, multi-core processors. Researchers at Sun Yat-sen University and Beijing Normal University address this challenge with a novel inference system detailed in their paper, ‘Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism’. Jinhui Wei, Ye Huang, Yuhui Zhou, Jiangsu Du, and Jiazhi Jiang present a system that combines speculative decoding – a technique to predict subsequent tokens and increase parallelism – with a heterogeneous computing approach, distributing workloads across diverse processing units to optimise performance. Their work introduces the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach, achieving a reported 7.6x speedup in LLM decoding on a Jetson NX platform.
Optimising Large Language Model Inference for End-User Devices
The efficient execution of large language models (LLMs) on end-user devices remains a significant challenge, with memory bandwidth consistently identified as a primary performance bottleneck. This limitation restricts the full utilisation of the heterogeneous processing units – comprising diverse cores with varying computational strengths – present in modern hardware. Current research focuses on enhancing parallelism and intelligently distributing computational workloads across these cores to overcome this constraint.
The Ghidorah system exemplifies this approach. It achieves substantial performance gains through a combination of speculative decoding and heterogeneous core model parallelism (HCMP). Speculative decoding accelerates text generation by predicting subsequent tokens before they are definitively required, and verifying these predictions concurrently. HCMP, in turn, leverages the unified memory architecture – where the CPU and GPU share the same memory space – common in end-user devices to partition the LLM effectively. This partitioning adapts to the computational demands of speculative decoding, allowing for more efficient workload distribution.
Researchers employ an architecture-aware profiling (ARCA) approach to determine the optimal speculative and partitioning strategies. ARCA balances the acceptance rate of predicted tokens – the proportion of correct predictions – with the degree of achievable parallelism, maximising overall speedup. Optimisation also extends to sparse computation on ARM CPUs, further enhancing performance on resource-constrained devices.
Experimental results demonstrate Ghidorah achieves up to a 7.6x speedup in the dominant LLM decoding phase compared to sequential decoding on a Jetson NX device. This indicates a considerable improvement in processing speed and efficiency.
Beyond Ghidorah, broader research encompasses distributed inference – utilising multiple devices for computation – model parallelism – dividing the model across multiple processors – and hardware acceleration, all aimed at overcoming the limitations of single-device processing. The increasing availability of open-source LLMs, such as Llama, is also driving innovation and accessibility in the field.
Future work will likely focus on refining speculative decoding strategies, developing more sophisticated partitioning algorithms, and exploring novel hardware architectures specifically tailored for LLM inference. A key area for investigation involves dynamically adapting workload distribution based on real-time performance metrics and model characteristics, ensuring optimal performance and efficiency.
Continued development of open-source LLMs and exploration of novel hardware architectures promise to further accelerate progress. By addressing the challenges of memory bandwidth, computational complexity, and energy efficiency, researchers aim to unlock the full potential of LLMs and make them accessible to a wider range of users and applications.
👉 More information
🗞 Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
🧠 DOI: https://doi.org/10.48550/arXiv.2505.23219
