The increasing demand for artificial intelligence applications places significant strain on available GPU resources, prompting cloud providers to seek efficient sharing technologies. Marco Zambianco, Lorenzo Fasol, and Roberto Doriguzzi-Corin from Fondazione Bruno Kessler address this challenge with a new scheduling framework for Multi-Instance GPU (MIG)-based cloud environments. Their research tackles the problem of GPU fragmentation, which arises from fixed partitioning and limits the number of workloads a cloud can accommodate, despite available resources. The team develops a novel, online scheduling algorithm that prioritises minimising fragmentation growth with each new workload, leading to demonstrably higher acceptance rates. Results show their method consistently schedules approximately 10% more workloads under heavy load, representing a substantial improvement in resource utilisation for AI-driven cloud services.

Differences between different tenants are essential to maximise the number of scheduled workloads. Among the various GPU sharing technologies, NVIDIA’s Multi-Instance GPU (MIG) stands out by partitioning GPUs at hardware level into isolated slices with dedicated compute and memory, ensuring strong tenant isolation, preventing resource contention, and enhancing security. Despite these advantages, MIG’s fixed partitioning introduces scheduling rigidity, leading to severe GPU fragmentation in multi-tenant environments where workloads are continuously deployed and terminated. Fragmentation leaves GPUs underutilised, limiting the number of workloads that can be accommodated.

MIG Fragmentation and Multi-Tenant GPU Challenges

GPU fragmentation presents a significant obstacle to efficient resource allocation in multi-tenant GPU clusters. This arises from the use of Multi-Instance GPU (MIG) technology and the dynamic allocation and deallocation of resources. MIG allows a single physical GPU to be partitioned into multiple virtual GPUs, improving resource utilization but also creating small, unusable chunks of GPU memory and compute units. In multi-tenant environments, such as cloud data centers, efficient resource allocation is crucial, and fragmentation reduces scheduling efficiency, potentially leading to wasted resources, increased latency, and reduced quality of service.

This problem is becoming increasingly relevant with the rise of deep learning, large language models, and other computationally intensive workloads that heavily rely on GPUs. To address this challenge, scientists developed a new metric specifically tailored for MIG to quantify the severity of GPU fragmentation, capturing the degree to which GPU resources are fragmented and difficult to allocate efficiently. Building on this metric, they developed a scheduling algorithm that leverages it to make informed allocation decisions, proactively minimizing fragmentation during the scheduling process. The algorithm operates in an online manner, making scheduling decisions in real-time without requiring prior knowledge of the workload distribution. Evaluations against existing scheduling strategies demonstrate consistent improvement in workload acceptance rates and reduced fragmentation levels across different load conditions and MIG profile distributions.

MIG Scheduling Framework Reduces GPU Fragmentation

This research delivers a novel scheduling framework designed to maximize workload acceptance and mitigate fragmentation in Multi-Instance GPU (MIG)-based cloud environments. Recognizing that efficient GPU sharing is crucial for cloud providers offering GPU-as-a-Service, scientists focused on overcoming the limitations of MIG’s fixed partitioning, which can lead to underutilized resources and reduced scheduling capacity. The team developed a new metric to analytically measure the severity of GPU fragmentation, providing a means to compare fragmentation levels across GPUs based on their current MIG profile allocations. This metric enables more informed decisions regarding future workload placements and facilitates a deeper understanding of resource inefficiency.

Building on this metric, the team designed a greedy scheduling algorithm that prioritizes minimizing fragmentation growth with each incoming workload. For each request for GPU resources, the algorithm evaluates available GPUs and selects the MIG slices that result in the least increase in fragmentation. Experiments demonstrate that this approach consistently achieves higher workload acceptance rates, delivering an average 10% increase in the number of scheduled workloads under heavy load conditions while maintaining comparable GPU usage to benchmark methods. The research highlights the impact of fragmentation, illustrating how the arrival and termination of workloads contribute to resource inefficiency. By addressing this challenge with an online, workload-agnostic scheduling algorithm, the team demonstrates a significant improvement in GPU utilization, potentially increasing revenue for cloud providers by accommodating a greater number of applications. The developed metric and scheduling algorithm represent a substantial advancement in managing MIG-based deployments in multi-tenant scenarios, offering a practical solution to maximize resource efficiency and improve service delivery.

MIG Scheduling Minimises Fragmentation and Improves Efficiency

This research presents a novel scheduling framework designed to improve the efficiency of Multi-Instance GPU (MIG) based cloud platforms, addressing a critical challenge in modern AI infrastructure. Scientists developed a method to mitigate GPU fragmentation, a common problem when sharing GPU resources among multiple users, which limits the number of workloads a system can accommodate. The team introduced a new fragmentation metric specifically tailored for MIG environments, quantifying resource inefficiency by measuring unschedulable MIG profiles. Building on this metric, researchers created a greedy scheduling algorithm that selects GPUs and MIG slices to minimise fragmentation growth with each incoming workload request.

Evaluations against existing scheduling strategies demonstrate a consistent improvement in workload acceptance rates, achieving an average 10% increase in scheduled workloads under heavy load conditions while maintaining comparable GPU usage. This advancement directly addresses the need for more efficient resource allocation in cloud environments supporting rapidly growing AI applications. The authors acknowledge that their algorithm operates without prior knowledge of workload characteristics, a design choice intended to maximise flexibility. Future work could explore incorporating workload predictions to further optimise scheduling decisions. However, the current research provides a significant step forward in managing GPU fragmentation and improving the overall utilisation of MIG-based cloud infrastructure, offering a practical solution for cloud providers seeking to maximise resource efficiency and accommodate a growing demand for AI computing.

👉 More information
🗞 An Online Fragmentation-Aware GPU Scheduler for Multi-Tenant MIG-based Clouds
🧠 ArXiv: https://arxiv.org/abs/2511.18906

Tags:

cloud computing GPU fragmentation GPU Sharing multi-instance GPU online scheduling tenant isolation workload acceptance Workload Scheduling

Rohail T.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy