Benchmarking of the Qualcomm Cloud AI 100 Ultra accelerator demonstrates competitive energy efficiency and performance against leading NVIDIA GPUs, including the A100, H200 and MI300A, when serving 15 open-source large language models ranging from 117 million to 90 billion parameters using the vLLM framework.
The increasing demand for artificial intelligence applications necessitates efficient hardware capable of supporting large language models (LLMs). This is particularly true within high-performance computing (HPC) environments where energy consumption and computational throughput are critical considerations. John J. Graham, Mohammad Firas Sada, and colleagues from the University of California, San Diego, address this challenge in their study, ‘Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs’. Their research benchmarks the Qualcomm Cloud AI 100 Ultra (QAic) accelerator against established GPUs from NVIDIA and AMD, evaluating performance and energy efficiency when serving fifteen open-source LLMs, ranging in size from 117 million to 90 billion parameters, within the National Research Platform (NRP) ecosystem. The team utilises the vLLM framework to assess the potential of QAic for demanding HPC applications.
Qualcomm’s Cloud AI 100 Ultra (QAic) accelerator demonstrates enhanced performance in large language model (LLM) inference, consistently exceeding the capabilities of leading graphics processing units (GPUs) from NVIDIA and AMD within the National Research Platform (NRP) ecosystem. Researchers meticulously evaluated fifteen diverse LLMs, ranging in parameter count from 117 million to 90 billion, utilising the vLLM framework, a fast and easy-to-use library for LLM serving, to assess both performance and energy efficiency. Results consistently show that QAic achieves higher throughput, measured in tokens per second, across all tested models, establishing a new reference point for LLM inference capabilities.
The performance advantage of QAic becomes more pronounced as model size increases, indicating its effectiveness in handling complex computational demands. For example, the Llama-2-70B model achieves 11.2 tokens per second on QAic, considerably surpassing the 7.1 tokens per second attained on comparable GPUs. This trend suggests that QAic’s architecture particularly benefits from serving larger, more complex LLMs, making it a suitable solution for computationally intensive applications.
The Mistral-7B model achieves the highest throughput on QAic, generating 62.3 tokens per second, although performance varies depending on model size and architecture. Researchers confirm that QAic delivers notable energy efficiency, a critical factor for high-performance computing applications, reducing power consumption and associated operational costs. This is increasingly important as the energy demands of artificial intelligence continue to rise.
Future work should investigate the scalability of QAic performance with even larger model sizes and more complex architectures, aiming to understand its limitations and potential for future improvements. Expanding the benchmark suite to include a wider range of LLMs and tasks, such as question answering and text summarization, will provide a more comprehensive assessment of QAic’s capabilities and versatility.
Exploring the potential of distributed inference across multiple QAic accelerators will allow for the handling of even larger models and higher throughput requirements, enabling more powerful AI applications. Detailed comparative analysis against additional GPU architectures, including the NVIDIA H200 and MI300A, is necessary to establish a definitive performance ranking and provide a comprehensive comparison of available solutions. Quantifying the latency associated with QAic inference will provide a more complete picture of its suitability for real-time applications and demonstrate its responsiveness.
👉 More information
🗞 Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
🧠 DOI: https://doi.org/10.48550/arXiv.2507.00418
