Inference Energy Consumption Diagnosed: LLM Tasks Show 25% Energy Differences

Researchers are increasingly focused on energy consumption within machine learning, recognising it as a critical computing resource. Jae-Won Chung, Ruofan Wu, and Jeff J. Ma, all from the University of Michigan & The ML.ENERGY Initiative, alongside Mosharaf Chowdhury, present a large-scale study investigating inference time and energy usage across 46 generative AI models, 7 tasks, and 1,858 configurations on H100 and B200 GPUs. This work is significant because it moves beyond simply measuring energy to diagnosing why differences occur, revealing order-of-magnitude variations, for example, large language models can differ in energy use by a factor of 25 depending on the task, and video generation can consume over 100times the energy of image generation. The team proposes a framework linking time and energy consumption to latent metrics like memory and utilisation, offering a pathway to optimise performance and throughput per watt for power-constrained datacentres.

This research addresses a critical gap in understanding why certain AI configurations consume more energy than others, moving beyond simple measurement to detailed diagnosis and optimisation. The team achieved empirical findings revealing order-of-magnitude variations in energy usage, with large language model (LLM) task type leading to 25× energy differences and video generation sometimes consuming over 100× the energy of image generation. These observations highlight the urgent need for a deeper understanding of the underlying mechanisms governing energy consumption in modern AI systems.

The study unveils a framework for reasoning about time and energy consumption, positing that these metrics are determined by latent factors such as memory and GPU utilization, which are influenced by algorithm, software, and hardware layers. Researchers conducted controlled comparisons, examining knobs at both the model and system levels to pinpoint specific factors affecting energy consumption. Counterintuitively, the work establishes that lower precision does not always equate to faster or more energy-efficient performance, and increasing the number of GPUs can, in some cases, reduce total energy consumption by unlocking larger memory capacity. These unexpected findings provide crucial clues into the complex interplay of factors governing energy usage.
This framework extends beyond simple wall-clock time and energy measurements to encompass throughput per watt, a critical metric for power-constrained datacenters. Experiments show that energy consumption is governed by latent factors like memory availability, hardware utilization, and application constraints, which are not directly observable from end metrics alone. By meticulously analysing 1,858 configurations, the researchers observed that LLM task type significantly impacts energy usage, with problem-solving tasks consuming 25× more energy per response than text conversation tasks due to longer output token sequences. Furthermore, the research details how different model architectures, LLMs, multimodal LLMs, and diffusion models, vary in energy consumption, revealing that task type heavily influences output length and, consequently, energy usage. Specifically, the team compared Qwen 3 32B across tasks on a single B200 GPU, finding that problem-solving generated 11× more output tokens than text conversation, leading to a 23× increase in energy per response. This detailed analysis provides a foundation for optimising AI workloads and designing more energy-efficient AI infrastructure, paving the way for sustainable growth in the field.

Generative AI Inference on H100 and B200 GPUs

Scientists initiated a large-scale measurement study of inference time and energy consumption across 46 generative AI models, encompassing 7 tasks and 1,858 distinct configurations on both H100 and B200 GPUs. The research team meticulously measured GPU energy, recognising its dominance in datacenter power consumption at 50, 70 percent, utilising the Zeus system for precise energy tracking and incorporating the latest LLMs like Qwen 3, DeepSeek R1, and GPT OSS alongside current diffusion models. A production-grade serving stack was implemented, employing vLLM 0.11.1 for LLMs and MLLMs and xDiT 0.4.5 for diffusion models, all running on NVIDIA H100 and B200 GPU nodes equipped with NVSwitch support. For LLMs, the study pioneered a method for identifying the steady state where batch size saturates, subsequently calculating energy per token by dividing the total steady-state energy consumption by the number of tokens generated during that period.

Diffusion models, processed in batches, had their per-request energy determined by dividing the total energy consumed by the batch size. Researchers systematically swept batch sizes and GPU counts for each model, defaulting to BF16 precision, while also including native FP8 weight LLMs and MLLMs as separate models for comparative analysis. This rigorous approach enabled the observation of order-of-magnitude variations, revealing that LLM task type could lead to 25x differences in energy consumption. The team further investigated the impact of input modality on energy usage, specifically comparing text, image, and video inputs for multimodal LLMs.

Experiments employed Qwen 3 VL 8B on a single B200 GPU to demonstrate how video generation sometimes consumes over 100times the energy of image processing. Crucially, the study connected energy consumption to latent metrics like memory and GPU utilization, establishing a framework for understanding the underlying mechanisms governing time and energy. By analysing KV cache utilization alongside batch size, scientists demonstrated that longer output sequences stress memory capacity, preventing larger batch sizes and increasing energy per token. This detailed methodology allowed the researchers to quantify how Problem Solving tasks, generating 10x more output tokens than Text Conversation, ultimately consumed 23x more energy per response, highlighting the significant influence of task type on overall energy expenditure. The work extends beyond simple measurement, providing a framework applicable to reasoning about the service capacity of power-constrained AI datacenters and offering insights into optimising resource allocation.

Generative AI inference energy varies greatly

Scientists conducted a large-scale measurement study of inference time and energy consumption across 46 generative AI models, encompassing 7 tasks and 1,858 different configurations on H100 and B200 GPUs. Experiments revealed substantial variations in energy usage, with task type influencing energy consumption by a factor of 25; specifically, Problem Solving tasks consumed 25times more energy per response than Text Conversation. Data shows that longer output sequences in Problem Solving tasks limit batch size and increase energy per token, resulting in 23times higher energy per response for this model compared to others. Researchers measured that video generation sometimes requires more than 100times the energy of image generation, highlighting the significant energy demands of multimodal tasks.

The team recorded that text plus image inputs used 1.1, 5.2times the energy per token of text alone, while text plus video inputs consumed 1.3, 15.0times more energy. Analysis of Qwen 3 models demonstrated that CPU-side vision preprocessing became a bottleneck, limiting batch size and increasing energy per token, even with ample GPU capacity. Measurements confirm that video inputs, requiring more extensive CPU-side processing and generating more vision tokens, exhibited a smaller batch size and higher energy per token. Tests prove that diffusion models exhibit energy consumption patterns not solely dictated by model size, but also by factors like the number of denoising steps, output resolution, and frame count.

Results demonstrate that generating a single video can consume between 26 kJ and 1.16 MJ, representing one to two orders of magnitude more energy than image generation. Specifically, CogVideoX 1.5 5B consumed more energy than Wan 2.1 14B due to higher resolution output, while HunyuanVideo reached 1.16 MJ by generating 129 frames at 720p. Scientists also investigated the relationship between batch size and performance metrics, observing that increasing batch size generally improves throughput but can also impact energy efficiency. Measurements for DeepSeek R1 and Qwen 3 Coder 30B showed trends in energy per token, tokens per second, median ITL, and power as batch size varied, normalized to a percentage of the maximum value. These findings establish a framework for understanding the latent metrics, memory and utilization, that govern time and energy consumption across the algorithm, software, and hardware layers, extending to throughput per watt for power-constrained datacenters.

Generative AI inference time and energy

Scientists have conducted a large-scale measurement study of inference time and energy consumption across 46 generative AI models, examining 7 tasks and 1,858 configurations on H100 and B200 GPUs. Their empirical findings reveal substantial variations in energy usage, with large language model task type potentially leading to a 25% difference and video generation sometimes consuming over 100times the energy of image generation. Differences in GPU utilization can also result in 3, 5times variations in energy consumption. Researchers present a framework for understanding the mechanisms governing time and energy consumption, positing that these metrics are determined by latent factors such as memory and utilization.

These latent factors are, in turn, influenced by elements across the algorithm, software, and hardware layers, extending to throughput per watt, a key metric for power-constrained data centres. The authors acknowledge limitations including the fixed batch size of one used in their analysis, which could inflate energy numbers, and the generalizability of measurements taken on internal systems. This work establishes that understanding inference energy consumption, beyond simply measuring it, is crucial for optimising AI infrastructure. By mapping factors and their relationship to energy consumption, the framework moves beyond black-box observations, allowing for tracing how model, system, and application factors impact energy usage. Future research could build upon this framework to further refine energy optimisation strategies in machine learning systems.

👉 More information
🗞 Where Do the Joules Go? Diagnosing Inference Energy Consumption
🧠 ArXiv: https://arxiv.org/abs/2601.22076

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Grb Afterglows Show No Cutoff above 0.1 keV, Challenging Acceleration Models

Grb Afterglows Show No Cutoff above 0.1 keV, Challenging Acceleration Models

February 3, 2026
Hubble Achieves 50% Completeness Mapping Stellar Populations in M96’s Halo

Hubble Achieves 50% Completeness Mapping Stellar Populations in M96’s Halo

February 3, 2026
Snowball Ising Machine Achieves Faster Combinatorial Optimisation with 3 Key Advances

Snowball Ising Machine Achieves Faster Combinatorial Optimisation with 3 Key Advances

February 3, 2026