Faster On-Device AI: Ghidorah Optimises Large Language Model Inference.

Ghidorah, a new large language model inference system, accelerates on-device processing by exploiting speculative decoding and distributing workloads across heterogeneous processing units. The system’s hetero-core model parallelism and architecture-aware profiling achieved up to a 7.6x speedup in decoding on Jetson NX, optimising sparse computation on ARM CPUs.

The increasing demand for on-device artificial intelligence necessitates efficient methods for deploying large language models (LLMs) directly on end-user hardware. Current limitations in memory bandwidth often restrict the potential of modern, multi-core processors. Researchers at Sun Yat-sen University and Beijing Normal University address this challenge with a novel inference system detailed in their paper, ‘Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism’. Jinhui Wei, Ye Huang, Yuhui Zhou, Jiangsu Du, and Jiazhi Jiang present a system that combines speculative decoding – a technique to predict subsequent tokens and increase parallelism – with a heterogeneous computing approach, distributing workloads across diverse processing units to optimise performance. Their work introduces the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach, achieving a reported 7.6x speedup in LLM decoding on a Jetson NX platform.

Optimising Large Language Model Inference for End-User Devices

The efficient execution of large language models (LLMs) on end-user devices remains a significant challenge, with memory bandwidth consistently identified as a primary performance bottleneck. This limitation restricts the full utilisation of the heterogeneous processing units – comprising diverse cores with varying computational strengths – present in modern hardware. Current research focuses on enhancing parallelism and intelligently distributing computational workloads across these cores to overcome this constraint.

The Ghidorah system exemplifies this approach. It achieves substantial performance gains through a combination of speculative decoding and heterogeneous core model parallelism (HCMP). Speculative decoding accelerates text generation by predicting subsequent tokens before they are definitively required, and verifying these predictions concurrently. HCMP, in turn, leverages the unified memory architecture – where the CPU and GPU share the same memory space – common in end-user devices to partition the LLM effectively. This partitioning adapts to the computational demands of speculative decoding, allowing for more efficient workload distribution.

Researchers employ an architecture-aware profiling (ARCA) approach to determine the optimal speculative and partitioning strategies. ARCA balances the acceptance rate of predicted tokens – the proportion of correct predictions – with the degree of achievable parallelism, maximising overall speedup. Optimisation also extends to sparse computation on ARM CPUs, further enhancing performance on resource-constrained devices.

Experimental results demonstrate Ghidorah achieves up to a 7.6x speedup in the dominant LLM decoding phase compared to sequential decoding on a Jetson NX device. This indicates a considerable improvement in processing speed and efficiency.

Beyond Ghidorah, broader research encompasses distributed inference – utilising multiple devices for computation – model parallelism – dividing the model across multiple processors – and hardware acceleration, all aimed at overcoming the limitations of single-device processing. The increasing availability of open-source LLMs, such as Llama, is also driving innovation and accessibility in the field.

Future work will likely focus on refining speculative decoding strategies, developing more sophisticated partitioning algorithms, and exploring novel hardware architectures specifically tailored for LLM inference. A key area for investigation involves dynamically adapting workload distribution based on real-time performance metrics and model characteristics, ensuring optimal performance and efficiency.

Continued development of open-source LLMs and exploration of novel hardware architectures promise to further accelerate progress. By addressing the challenges of memory bandwidth, computational complexity, and energy efficiency, researchers aim to unlock the full potential of LLMs and make them accessible to a wider range of users and applications.

👉 More information
🗞 Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
🧠 DOI: https://doi.org/10.48550/arXiv.2505.23219

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025