The escalating demands of training large language models require a detailed understanding of how performance, power consumption, and thermal behaviour interact across distributed computing systems. Seokjin Go, Joongun Park, Spandan More, and colleagues at Georgia Institute of Technology, along with Hanjiang Wu, Irene Wang, and Aaron Jezghani, present a comprehensive characterization of these interactions across diverse hardware platforms, including NVIDIA and AMD GPUs. Their work investigates how different training strategies, such as data, tensor, pipeline, and expert parallelism, impact hardware utilization and efficiency. The team demonstrates that simply increasing hardware capacity does not guarantee improved performance, and that careful configuration is crucial, with smaller systems sometimes outperforming larger ones. These findings reveal the complex interplay between hardware, system design, and model execution, offering vital recommendations for building more scalable and reliable systems for future large language models.
Distributed Training of Large Language Models
A comprehensive body of research explores the challenges and advancements in training large language models (LLMs) using distributed computing techniques, focusing on scaling training across multiple GPUs and nodes to accelerate development. Key areas of investigation include model, data, and pipeline parallelism, alongside hybrid approaches and techniques like Mixture of Experts (MoE) which increase model capacity by activating only a subset of parameters for each input. Scientists are developing tools for benchmarking performance, analyzing power consumption, and managing thermal behavior, including investigations into hardware/software co-design for peak efficiency. Understanding the resource requirements, data needs, and performance bottlenecks of LLMs is central to driving innovation in LLM development.
Fully Sharded Data Parallelism (FSDP), tensor parallelism, and LoRA, a parameter-efficient fine-tuning technique, are actively investigated, utilizing frameworks like PyTorch, NVIDIA’s Nemo, and Lightning AI to simplify the training process. Researchers leverage tools like Torch compile, NVML, ROCm SMI, Astra-Sim, and Zeus to optimize execution, monitor GPUs, simulate platforms, and analyze energy consumption. The research leverages large datasets like The Pile and PubMedQA, with a growing emphasis on sustainability to reduce energy consumption and minimize environmental impact. Overarching trends reveal a paramount need for scaling, efficiency in computational cost, memory usage, and energy consumption, and the vital importance of co-designing hardware and software, representing the cutting edge of large-scale machine learning.
LLM Training Across Diverse Hardware Platforms
Scientists conducted a comprehensive study of Large Language Model (LLM) training across diverse hardware platforms, including systems with NVIDIA H200, H100, and AMD MI250 GPUs, aiming to understand the interplay between software and hardware during large-scale training. The team systematically analyzed various parallelism strategies, tensor, pipeline, data, and expert, and evaluated their impact on hardware utilization, power consumption, and thermal behavior. The study pioneered a detailed examination of optimization techniques, specifically activation recomputation and compute-communication overlap, to assess their effectiveness in improving training efficiency. Researchers carefully monitored how these optimizations affected memory usage, computational load, and communication overhead, revealing potential trade-offs related to hardware bottlenecks and the impact of microbatch size. To understand the physical limits of LLM training, scientists meticulously tracked power consumption and thermal behavior across GPU clusters, observing and quantifying thermal hotspots, power capping, and frequency throttling. The research team analyzed how asymmetric GPU throttling disrupted synchronization and skewed performance, demonstrating that while scale-out systems generally offer higher aggregate compute, scale-up systems can outperform in communication-heavy regimes under optimized configurations.
Parallelism Strategy Dictates LLM Training Performance
The research team conducted a comprehensive characterization of large language model (LLM) training across modern datacenter hardware, revealing intricate interactions between software parallelism and underlying hardware behavior. Experiments utilizing NVIDIA H100, H200, and MI250 GPUs demonstrated that performance is significantly influenced by the chosen parallelism strategy and system configuration. Results show that scale-up systems can outperform scale-out systems in communication-bound scenarios, but only with carefully tuned configurations, while scale-out deployments often achieve superior throughput when optimized effectively. The study revealed inefficiencies in combining tensor and pipeline parallelism, leading to underutilization of PCIe bandwidth, and that increasing microbatch sizes beyond an optimal point reduces training efficiency due to communication saturation.
Measurements confirmed that peak power draw and chip temperatures increase with larger microbatch sizes, intensifying thermal throttling and negatively impacting performance. The team observed significant performance variability due to thermal imbalances across GPUs, highlighting the need for strategies that more aggressively utilize cooler GPUs. In one instance, a node-level power failure caused GPUs to run more than four times slower, disrupting the entire training pipeline. These findings underscore the importance of co-designing parallelism strategies and system-level execution with awareness of both algorithmic structure and hardware characteristics to achieve robust and scalable LLM training performance.
Hardware Scaling Limits Large Language Models
This research presents a comprehensive characterization of large language model training across diverse hardware platforms and parallelism strategies, demonstrating that simply increasing hardware capacity does not guarantee improved performance due to system-level bottlenecks related to communication and thermal behavior. The team found that scale-up systems can outperform scale-out systems in certain communication-bound scenarios, provided configurations are carefully tuned, while scale-out deployments often achieve superior throughput when appropriately configured. The investigation further reveals that specific combinations of parallelism techniques can lead to bandwidth underutilization, while excessively large microbatch sizes induce performance-limiting bursts of activity and exacerbate thermal throttling, highlighting the complex interplay between hardware, system topology, and model.
👉 More information
🗞 Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
🧠 ArXiv: https://arxiv.org/abs/2509.10371
