Fragmented computation graphs currently limit the performance of PyTorch 2, despite recent advances in just-in-time compilation, and Savini Kashmira, Jayanaka Dantanarayana, and Thamira-waran Sathiyalogeswaran, from the University of Michigan and Jaseci Labs, alongside Yichao Yuan, Nishil Talati, and Krisztian Flautner, address this challenge with GraphMend, a novel high-level compiler. GraphMend automatically analyses and transforms source code to eliminate breaks in these graphs, which arise from dynamic control flow and unsupported Python features, thereby preventing costly performance bottlenecks. The team demonstrates that GraphMend successfully removes all fixable graph breaks in several popular models, achieving substantial latency reductions of up to 75% and throughput improvements of up to 8% on modern GPUs. This achievement signifies a significant step forward in simplifying the development of high-performance machine learning models within the PyTorch ecosystem, offering both improved usability and enhanced speed.

Automated CUDA Graph Optimization for PyTorch Models

This research introduces a system that optimizes PyTorch programs by effectively utilizing CUDA graphs. The core idea is to automatically transform PyTorch models into CUDA graphs, reducing CPU overhead and improving GPU utilization, particularly for dynamic models. While PyTorch supports CUDA graphs, achieving optimal performance requires careful consideration of graph construction and execution; this system bridges that gap by automating the process and addressing challenges related to dynamic shapes and control flow. Key achievements include automated graph transformation, reducing the need for manual construction, and techniques for handling dynamic shapes by recompiling graphs when necessary.

The system also optimizes control flow within CUDA graphs, reducing overhead associated with branching and conditional execution. Experiments demonstrate significant performance improvements, up to a 2x speedup, on various models and datasets, with the greatest benefits observed in models with high CPU overhead. The system integrates with the PyTorch profiler, allowing developers to easily identify and optimize performance bottlenecks.

GraphMend Compiles PyTorch Programs Without Fragmentation

Scientists developed GraphMend, a high-level compiler that eliminates fragmentation in PyTorch 2 programs, substantially improving performance and usability. Existing dynamic JIT compilation pipelines often encounter breaks in FX graphs due to dynamic control flow and unsupported Python constructs, forcing inefficient switches between eager mode and graph mode. GraphMend addresses this limitation by analyzing and transforming source code before execution, enabling the compilation of larger, uninterrupted FX graphs without requiring manual code adjustments. The system builds upon the Jac compilation framework and implements two key code transformations specifically targeting dynamic control flow and Python I/O functions.

Experiments across eight Hugging Face models demonstrate GraphMend’s effectiveness, completely removing all fixable graph breaks in six models and reducing the break count in another. This transformation delivers significant performance gains, achieving up to 75% reductions in cold-start forward latency and up to 25% lower steady-state latency on NVIDIA RTX 3090 and A40 GPUs. Furthermore, the team measured up to 8% higher end-to-end throughput, demonstrating improved efficiency in processing data.

GraphMend Eliminates PyTorch 2 Compilation Fragmentation

Scientists developed GraphMend, a high-level compiler that eliminates fragmentation in PyTorch 2 programs, substantially improving performance and usability. The work addresses a key limitation in PyTorch 2’s compilation pipeline, where dynamic control flow and unsupported Python constructs often result in models being split into multiple FX graphs, forcing frequent switches between the CPU and GPU. GraphMend analyzes and transforms source code before execution, proactively restructuring code to avoid these breaks. The system operates within the Jac compilation framework, utilizing abstract syntax trees and control-flow graphs to identify and eliminate patterns likely to cause fragmentation. Experiments demonstrate that GraphMend successfully removes nearly all fixable graph breaks, enabling larger, more efficient computational graphs. This results in substantial performance gains, with latency reductions of up to 75% and throughput improvements of up to 8% on modern graphics processing units.

GraphMend Eliminates Compilation Fragmentation for Speed

Researchers have developed GraphMend, a technique that improves the performance of PyTorch 2 programs by eliminating fragmentation during compilation. Current systems often encounter breaks in the compilation process due to dynamic control flow and standard Python input/output operations, forcing a switch to slower execution modes. GraphMend addresses this limitation by analyzing and transforming source code before it runs, restructuring these problematic elements into forms compatible with sustained graph compilation. This research demonstrates the effectiveness of high-level code transformation as a complement to existing just-in-time compilation techniques, offering a pathway to both improved usability and enhanced performance in deep learning frameworks.

👉 More information
🗞 GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
🧠 ArXiv: https://arxiv.org/abs/2509.16248

Tags:

code transformation dynamic control flow End-to-End Throughput FX graphs GPU optimization GraphMend just-in-time compilation PyTorch 2 TorchDynamo TorchInductor

Graphmend Fixes PyTorch 2 Graph Breaks, Enabling Compilation of 75% More Models and Reducing Fallbacks by 8%

Automated CUDA Graph Optimization for PyTorch Models

GraphMend Compiles PyTorch Programs Without Fragmentation

GraphMend Eliminates PyTorch 2 Compilation Fragmentation

GraphMend Eliminates Compilation Fragmentation for Speed

Rohail T.

Latest Posts by Rohail T.:

Accurate Model Predicts Quantum Data Noise Levels

Current Switching Boosts Superconducting Diode Efficiency

Stress Boosts Superconductivity in Novel Material