Large Language Models (LLMs) are rapidly transforming cloud services, but their intensive computational demands place a significant strain on energy resources. Qunyou Liu, Darong Huang, and colleagues at École Polytechnique Fédérale de Lausanne, along with Marina Zapater from HES-SO University of Applied Sciences and Arts Western Switzerland, address this challenge with a new framework called GreenLLM. The team recognises that LLM inference differs from typical GPU workloads, exhibiting distinct characteristics in its prefill and decode stages, and current power management systems fail to account for this asymmetry. GreenLLM dynamically adjusts GPU frequency based on request length and performance targets, minimising energy consumption without compromising speed or reliability, and represents a substantial step towards sustainable artificial intelligence. Across extensive testing with real-world data, the framework achieves up to a 34 percent reduction in energy use, demonstrating its potential for widespread impact.
LLM inference differs from traditional GPU workloads because it consists of two distinct stages with different characteristics: the prefill phase, which prioritizes low latency and scales with prompt length, and the decode phase, which progresses token by token. Understanding these differing characteristics is crucial for optimising energy efficiency. Current hardware and software often treat both phases identically, missing opportunities for significant energy savings through tailored acceleration strategies. This research investigates the energy-performance trade-offs within LLM inference, aiming to develop a more nuanced approach to hardware and software co-design. The objective is to identify and implement acceleration techniques specifically suited to each phase, thereby reducing overall energy consumption without compromising performance.
SLO-Aware Dynamic Frequency Scaling for LLMs
Detailing the SLO-Aware GreenLLM System Architecture
Architecting GreenLLM for Energy-Efficient LLM Serving
This document details GreenLLM, a system designed to improve the energy efficiency of Large Language Model (LLM) serving while maintaining Service Level Objectives (SLOs). GreenLLM prioritizes maintaining performance targets like latency and throughput, making power optimizations without sacrificing these key metrics. The system dynamically adjusts the GPU clock frequency based on current workload characteristics, reducing power consumption when possible without impacting performance. GreenLLM recognizes that LLM workloads have distinct phases (prefill and decode) with different resource demands, and aims for fine-grained control over GPU frequency, allowing precise adjustments based on real-time workload demands.
GreenLLM Optimizes LLM Inference Energy Use
Optimizing Inference by Addressing Prefill and Decode Asymmetries
Researchers developed GreenLLM, a new framework designed to significantly reduce the energy consumption of large language model (LLM) inference without sacrificing performance. The team addressed the asymmetry between the prefill and decode stages of LLM processing, optimizing energy use by separating control of these stages. The system employs length-based request routing, directing shorter prompts to avoid delays and improve time-to-first-token (TTFT). During the prefill stage, GreenLLM analyzes traces to predict the relationship between GPU frequency and latency, then selects the most energy-efficient clock speed for each prompt class.
Demonstrating Energy Savings Using Production Workloads
Demonstrated Energy Efficiency Using Production Traces
Experiments demonstrate that for short and medium prompts, GreenLLM intentionally allows slightly higher TTFT to conserve energy, achieving savings of up to 20% while remaining within service level objectives (SLOs). For longer prompts, energy savings can reach 25, 30% at mid-load. During the decode phase, a controller dynamically adjusts GPU frequency based on token generation throughput, maintaining tail latency within target bounds. Microbenchmarks reveal that GreenLLM closely tracks the performance of default GPU settings while reducing energy consumption by 8, 25%, with the greatest gains observed when workloads have available headroom.
Evaluations using both synthetic and production traces demonstrate GreenLLM’s ability to adapt to changing demands, tracking workload intensity and modulating GPU clocks in real-time to minimize energy use while maintaining SLO compliance. Across evaluations using production traces, GreenLLM reduces total energy consumption by up to 34% compared to standard GPU settings, with no loss of throughput and less than 3. 5% additional SLO violations.
Real-Time Dynamic Frequency Scaling for Performance
Optimized LLM Serving with Dynamic Frequency Scaling
Addressing Prefill and Decode Asymmetries for Scaling
GreenLLM presents a new framework for improving the energy efficiency of large language model serving. The research addresses the distinct characteristics of LLM inference, which involves a prefill and decode phase, and demonstrates that treating these phases uniformly leads to wasted energy. GreenLLM optimizes energy use by separating control of these two phases, employing length-based request routing to reduce delays and a model-driven approach to select efficient clock speeds during prefill. A lightweight controller then dynamically adjusts frequency during decoding to maintain performance targets while minimizing energy consumption. Across tests using real-world data, GreenLLM consistently reduced node energy by between 10 and 34 percent compared to standard GPU governors, without compromising throughput or significantly impacting service level objectives. The framework operates on existing GPUs and integrates seamlessly with current serving infrastructure, requiring no modifications to the language models themselves.
🗞 GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
🧠 ArXiv: https://arxiv.org/abs/2508.16449
The efficacy of GreenLLM relies on integrating precise knowledge of LLM computational graphs into the hardware scheduler. Unlike general-purpose workload schedulers, which treat GPU utilization purely by compute units, GreenLLM requires awareness of data dependencies and memory bandwidth bottlenecks inherent to the transformer architecture. This necessitates specialized hardware-aware profiling tools that can accurately delineate the shift in resource bottlenecks between the sequence-parallel prefill phase and the memory-bound, sequential decode phase in real time.
From a system-level perspective, implementing dynamic frequency scaling adds complexity to maintaining stability and predicting thermal profiles. The framework must incorporate advanced control loops that model the dynamic relationship between power draw, clock frequency, and achievable latency variance. By treating these variables as coupled constraints within an optimization problem, GreenLLM moves beyond simple power capping to achieve genuine performance-aware energy minimization.
Furthermore, the demonstrated reduction in energy expenditure highlights a critical industry bottleneck: the energy cost of scaling AI. As model sizes increase and deployment moves toward the edge, optimization frameworks like GreenLLM become foundational infrastructure components. This technical leap suggests a shift in AI infrastructure focus, moving energy efficiency from an academic consideration to a core economic metric for cloud providers.
