Researchers develop GreenLLM framework to minimise GPU energy for Large Language Model inference

Large Language Models (LLMs) are rapidly transforming cloud services, but their intensive computational demands place a significant strain on energy resources. Qunyou Liu, Darong Huang, and colleagues at École Polytechnique Fédérale de Lausanne, along with Marina Zapater from HES-SO University of Applied Sciences and Arts Western Switzerland, address this challenge with a new framework called GreenLLM. The team recognises that LLM inference differs from typical GPU workloads, exhibiting distinct characteristics in its prefill and decode stages, and current power management systems fail to account for this asymmetry. GreenLLM dynamically adjusts GPU frequency based on request length and performance targets, minimising energy consumption without compromising speed or reliability, and represents a substantial step towards sustainable artificial intelligence. Across extensive testing with real-world data, the framework achieves up to a 34 percent reduction in energy use, demonstrating its potential for widespread impact.

LLM inference differs from traditional GPU workloads because it consists of two distinct stages with different characteristics: the prefill phase, which prioritizes low latency and scales with prompt length, and the decode phase, which progresses token by token. Understanding these differing characteristics is crucial for optimising energy efficiency. Current hardware and software often treat both phases identically, missing opportunities for significant energy savings through tailored acceleration strategies. This research investigates the energy-performance trade-offs within LLM inference, aiming to develop a more nuanced approach to hardware and software co-design. The objective is to identify and implement acceleration techniques specifically suited to each phase, thereby reducing overall energy consumption without compromising performance.

SLO-Aware Dynamic Frequency Scaling for LLMs

This document details GreenLLM, a system designed to improve the energy efficiency of Large Language Model (LLM) serving while maintaining Service Level Objectives (SLOs). GreenLLM prioritizes maintaining performance targets like latency and throughput, making power optimizations without sacrificing these key metrics. The system dynamically adjusts the GPU clock frequency based on current workload characteristics, reducing power consumption when possible without impacting performance. GreenLLM recognizes that LLM workloads have distinct phases (prefill and decode) with different resource demands, and aims for fine-grained control over GPU frequency, allowing precise adjustments based on real-time workload demands.

GreenLLM Optimizes LLM Inference Energy Use

Researchers developed GreenLLM, a new framework designed to significantly reduce the energy consumption of large language model (LLM) inference without sacrificing performance. The team addressed the asymmetry between the prefill and decode stages of LLM processing, optimizing energy use by separating control of these stages. The system employs length-based request routing, directing shorter prompts to avoid delays and improve time-to-first-token (TTFT). During the prefill stage, GreenLLM analyzes traces to predict the relationship between GPU frequency and latency, then selects the most energy-efficient clock speed for each prompt class.

Experiments demonstrate that for short and medium prompts, GreenLLM intentionally allows slightly higher TTFT to conserve energy, achieving savings of up to 20% while remaining within service level objectives (SLOs). For longer prompts, energy savings can reach 25, 30% at mid-load. During the decode phase, a controller dynamically adjusts GPU frequency based on token generation throughput, maintaining tail latency within target bounds. Microbenchmarks reveal that GreenLLM closely tracks the performance of default GPU settings while reducing energy consumption by 8, 25%, with the greatest gains observed when workloads have available headroom.

Evaluations using both synthetic and production traces demonstrate GreenLLM’s ability to adapt to changing demands, tracking workload intensity and modulating GPU clocks in real-time to minimize energy use while maintaining SLO compliance. Across evaluations using production traces, GreenLLM reduces total energy consumption by up to 34% compared to standard GPU settings, with no loss of throughput and less than 3. 5% additional SLO violations.

Optimized LLM Serving with Dynamic Frequency Scaling

GreenLLM presents a new framework for improving the energy efficiency of large language model serving. The research addresses the distinct characteristics of LLM inference, which involves a prefill and decode phase, and demonstrates that treating these phases uniformly leads to wasted energy. GreenLLM optimizes energy use by separating control of these two phases, employing length-based request routing to reduce delays and a model-driven approach to select efficient clock speeds during prefill. A lightweight controller then dynamically adjusts frequency during decoding to maintain performance targets while minimizing energy consumption. Across tests using real-world data, GreenLLM consistently reduced node energy by between 10 and 34 percent compared to standard GPU governors, without compromising throughput or significantly impacting service level objectives. The framework operates on existing GPUs and integrates seamlessly with current serving infrastructure, requiring no modifications to the language models themselves.

👉 More information
🗞 GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
🧠 ArXiv: https://arxiv.org/abs/2508.16449

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

MIT Research Reveals Cerebellum’s Role in Language Network, Expanding Brain Mapping

MIT Research Reveals Cerebellum’s Role in Language Network, Expanding Brain Mapping

February 6, 2026
ETH Zurich Researchers Achieve "Surgery" on Qubits, Advancing Quantum Error Correction

ETH Zurich Researchers Achieve “Surgery” on Qubits, Advancing Quantum Error Correction

February 6, 2026
Infleqtion Develops Hyper-RQAOA Quantum Routine for Real-World Cancer Biomarker Analysis in Phase 3 Trial

Infleqtion Develops Hyper-RQAOA Quantum Routine for Real-World Cancer Biomarker Analysis in Phase 3 Trial

February 6, 2026