The surging popularity of large language models (LLM) presents a growing challenge, as inference, the process of using these models to respond to queries, now consumes the vast majority of their energy. To address this critical need for understanding LLM power consumption, Chenxu Niu from Texas Tech University, Wei Zhang from Texas Advanced Computing Center, and Jie Li, along with Yongjian Zhao, Tongyang Wang, and Xi Wang, introduce TokenPowerBench, a novel benchmark designed specifically for measuring and analysing the energy used during LLM inference. This innovative tool uniquely combines a flexible configuration system, a measurement layer that accurately captures power usage without specialised equipment, and a detailed metrics pipeline that breaks down energy consumption by different stages of the inference process. By enabling researchers and developers to easily assess the impact of various settings on energy efficiency, TokenPowerBench facilitates the optimisation of LLM deployments, helping to reduce operating costs and meet increasingly important sustainability goals, and the team demonstrates its effectiveness across a range of widely used models, including Llama, Falcon, Qwen, and Mistral.

LLM Energy Use and Sustainable AI

This document summarizes research concerning sustainable artificial intelligence, specifically focusing on the energy efficiency and performance of Large Language Models (LLMs). A central concern is the increasing energy demand of both training and, crucially, running (inference) LLMs, emphasizing the need to consider energy efficiency alongside traditional performance metrics like speed and accuracy. Several studies directly address benchmarking and measuring the energy consumption of LLM inference, with MLPerf repeatedly cited as a key benchmark suite now including power measurements. Researchers are also developing new tools and methodologies to accurately measure energy consumption at various levels, from individual components to entire systems, while exploring hardware acceleration with GPUs and other options like FPGAs.

Software optimization techniques, including quantization, pruning, distillation, and efficient kernel implementations, are being used to reduce model size and computational complexity without significant accuracy loss, with Nvidia TensorRT-LLM specifically mentioned as a software framework. Researchers are also investigating specialized hardware architectures and exploring the use of LLMs themselves to automate and improve hardware design processes, assisting with RTL design, hardware Trojan detection, and code generation. Efficient data management and metadata handling are recognized as important for scientific data analysis and LLM applications, extending sustainability beyond energy efficiency to encompass responsible resource utilization and reducing electronic waste. This collection of references paints a picture of a field actively grappling with the sustainability challenges of LLMs, moving in two key directions: reducing the energy footprint of LLMs and leveraging LLMs to create more efficient and sustainable hardware designs, suggesting a potential path towards a more environmentally responsible AI future.

TokenPowerBench Measures LLM Inference Energy Use

Scientists developed TokenPowerBench, a novel benchmark designed to comprehensively measure and analyze power consumption during large language model (LLM) inference, recognizing that inference accounts for over 90% of total power draw. The system integrates directly with vendor telemetry APIs to capture GPU, CPU, and memory power sensor data without requiring specialized external power meters, allowing for scalable and accessible testing. The core innovation lies in the metrics pipeline, which aligns every power sample with the prefill and decode phases of LLM inference, enabling researchers to pinpoint exactly where energy is consumed during the process. The study pioneers a phase-aware, token-level measurement approach, filling a critical gap left by existing benchmarks like MLPerf, and released TokenPowerBench as open source to facilitate broader adoption and further research in this critical area, enabling researchers and developers to optimize energy efficiency and meet sustainability targets when deploying large language models.

Granular LLM Power Measurement with TokenPowerBench

The research team presents TokenPowerBench, a new benchmark designed to meticulously measure power consumption during large language model (LLM) inference, revealing detailed energy usage data not previously available, and tracking power usage at the GPU, node, and system levels, distinguishing between the prefill and decode stages of inference for each request. Results show that energy consumption rises faster than the parameter count within each LLM family, and for Llama3, increasing from 1 billion to 70 billion parameters increased energy per token by 7. 3times, despite only a 70-fold increase in parameters. Furthermore, the study demonstrates that engines like TensorRT-LLM and vLLM reduce energy per token by 25-40% relative to Transformers, highlighting the impact of software optimization on energy efficiency. Analysis of batch size reveals that increasing the batch size initially lowers energy per token, while experiments also show that longer prompts increase energy consumption, and optimized parallelism strategies deliver significant energy savings.

LLM Inference Power Measurement and Metrics

TokenPowerBench represents a significant advancement in the evaluation.

👉 More information
🗞 TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
🧠 ArXiv: https://arxiv.org/abs/2512.03024

Tags:

batch size context length energy efficiency Falcon Llama LLM Inference Mistral power consumption Quantization TokenPowerBench

Tokenpowerbench Achieves LLM Inference Power Consumption Analysis, Attributing over 90% of Energy to Prefill and Decode Stages

LLM Energy Use and Sustainable AI

TokenPowerBench Measures LLM Inference Energy Use

Granular LLM Power Measurement with TokenPowerBench

LLM Inference Power Measurement and Metrics

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks