GPU Scheduling Achieves 35% Performance Boost with Fragmentation-Aware Online Management

The growing demand for efficient use of powerful graphics processing units (GPUs) drives research into resource sharing techniques, and a team led by Hsu-Tzu Ting and Jerry Chou from National Tsing Hua University, alongside Ming-Hung Chen and I-Hsin Chung from IBM T. J. Watson Research Center, now presents a novel solution for managing workloads on multi-instance GPUs. Their work addresses a key limitation of current GPU partitioning technologies, namely the problem of fragmentation that arises from both limited configuration options and the constraints of job placement. The researchers developed an online scheduling framework that dynamically balances loads, adjusts GPU partitioning, and migrates jobs to minimise resource contention and combat fragmentation, ultimately achieving substantial improvements in system efficiency and reducing overall processing time by up to 35 percent. This approach represents a significant step towards maximising the utilisation of expensive GPU resources and enabling more effective management of increasingly complex computational tasks.

Dynamic MIG Allocation Reduces GPU Fragmentation

This research introduces a system for efficiently sharing GPUs using Multi-Instance GPU (MIG) technology, tackling the problem of GPU fragmentation. Fragmentation arises when allocating resources to diverse workloads, leading to underutilization and performance degradation. The authors aim to maximize GPU utilization and minimize latency by intelligently scheduling tasks on MIG partitions. Key findings demonstrate that intelligent scheduling, guided by fragmentation awareness, is crucial for maximizing the benefits of MIG technology and achieving efficient GPU sharing in cloud and data center environments.

The team developed a Fragmentation Gradient Descent (FGD) technique, which calculates fragmentation to guide allocation and mitigate hotspots. The system was evaluated through simulations and experiments using a real-world workload dataset, BurstGPT, to facilitate research and benchmarking of LLM serving systems. Results show significant improvements in GPU utilization, reduced latency, and better overall performance compared to baseline scheduling approaches. This work contributes to better resource management in cloud and data center environments, particularly for demanding workloads like Large Language Models.

Dynamic GPU Scheduling for Efficient Resource Use

Scientists engineered a scheduling system that integrates conditional load balancing, dynamic partitioning, and job migration. The method continuously monitors GPU resource usage and proactively migrates jobs between MIG instances to balance the load and reduce contention for shared resources. The team implemented dynamic partitioning, allowing the system to create or destroy MIG instances at runtime, adapting to changing workload demands without disrupting co-located jobs. Experiments demonstrate a significant improvement in system efficiency, with the makespan improving by up to 35%, demonstrating the potential to substantially enhance GPU utilization in modern data centers.

MIG Scheduling Optimisation Achieves 35% Improvement

Researchers have developed a novel scheduling framework to significantly improve the efficiency of Multi-Instance GPU (MIG) systems, achieving a makespan improvement of up to 35%. The work addresses critical challenges in MIG environments, specifically resource contention and fragmentation that arise from shared components like PCIe bandwidth and the limited number of valid MIG configurations. The team’s approach integrates conditional load balancing, dynamic partitioning, and job migration to optimize GPU resource utilization. Conditional load balancing actively distributes workloads across GPUs, preventing resource bottlenecks.

To combat fragmentation, the researchers implemented dynamic partitioning and job migration, reorganizing GPU allocations to overcome both internal and external fragmentation issues. The study demonstrates that even with ample available resources, instance creation can be hindered by the fixed set of valid MIG configurations, introducing a new form of fragmentation. The team’s method dynamically adapts job placement and reorganizes allocations, effectively addressing this limitation and maximizing resource use.

Multi-Instance GPU Scheduling Optimizes System Performance

This research presents a novel scheduling framework designed to improve the efficiency of Multi-Instance GPU systems, which enable resource sharing through hardware-level partitioning. The team addressed key challenges arising from this technology, specifically resource contention and GPU fragmentation, a problem distinct from traditional GPU sharing due to the limited valid configurations within MIG. Their approach integrates conditional load balancing, dynamic partitioning, and job migration to optimize GPU utilization. Experimental results demonstrate significant improvements in system performance through this integrated method. The team achieved reductions in job wait time and execution time, ultimately leading to a substantial decrease in overall makespan, with improvements ranging from 13% to 35%. This work highlights the practical benefits of intelligently managing GPU resources in shared environments and paves the way for future exploration of more sophisticated algorithms for predicting and mitigating fragmentation.

👉 More information
🗞 An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUs
🧠 ArXiv: https://arxiv.org/abs/2512.16099

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Dc-powered Inelastic Cooper-pair Tunneling Amplifier Enables Quantum-Limited Gain

Dc-powered Inelastic Cooper-pair Tunneling Amplifier Enables Quantum-Limited Gain

December 30, 2025
Multivariate Polynomials Enable First Post-Quantum Secure End-to-End E-Voting Protocol

Multivariate Polynomials Enable First Post-Quantum Secure End-to-End E-Voting Protocol

December 30, 2025
Multi-scale Geometry Transformer Enables Accurate Drag/Lift Prediction in CAE Simulations

Multi-scale Geometry Transformer Enables Accurate Drag/Lift Prediction in CAE Simulations

December 30, 2025