Mt4g Tool Reliably Auto-Discovers GPU Compute and Memory Topologies Using 50+ Microbenchmarks

Understanding the intricate architecture of modern GPUs is crucial for optimising performance in demanding fields like high-performance computing and artificial intelligence, yet detailed information has historically been difficult to obtain. Stepan Vanecek, Manuel Walter Mussbacher, Dominik Groessler, and Urvij Saroliya from the Technical University of Munich, along with Martin Schulz, now present MT4G, a new tool that automatically and reliably maps the compute and memory topologies of both NVIDIA and AMD GPUs. MT4G overcomes the limitations of existing methods by combining established APIs with a comprehensive suite of microbenchmarks and applying statistical analysis to reveal previously inaccessible details, such as cache sizes and bandwidths. This achievement provides a portable, vendor-agnostic solution for characterising modern GPU systems and has already demonstrated impact through improvements in performance modelling, bottleneck analysis, and dynamic resource partitioning workflows.

GPU Performance Bottleneck Analysis Using Microbenchmarks

This document presents a comprehensive analysis of GPU performance, focusing on identifying bottlenecks and understanding the intricacies of modern graphics processing units from NVIDIA and AMD. The core goal is to pinpoint limitations within GPUs, whether related to memory bandwidth, computational capacity, or other architectural features, and this is achieved through the use of microbenchmarks. A deep understanding of GPU architecture, including its memory hierarchy, compute units, and interconnects, is crucial for interpreting results and identifying these bottlenecks. The research explores key concepts such as the roofline model, a visual tool for identifying memory-bound versus compute-bound applications, and the STREAM benchmark, a standard for measuring memory bandwidth.

Optimizing data access patterns to maximize bandwidth utilization, known as data coalescing, is also investigated. Furthermore, the study utilizes tools like GPUscout to locate data movement-related bottlenecks on GPUs, and employs profiling tools including NVIDIA Nsight Compute and Systems, and AMD ROCprofiler and ROCm SMI. The work also considers architectural representations, including extensions for quantum computing, and virtualization technologies like NVIDIA’s Multi-Instance GPU (MIG) and Virtual Compute Server.

GPU Topology Discovery With Microbenchmarks

Scientists have developed MT4G, a new open-source tool that automatically discovers and reports on the complex topologies of modern GPUs, addressing a significant gap in the field of high-performance computing and artificial intelligence. Unlike existing methods which rely on incomplete or vendor-specific information, MT4G combines established APIs with a suite of over 50 carefully designed microbenchmarks to reveal crucial details about GPU architecture, including cache sizes, bandwidths, and physical layouts. The tool employs robust statistical methods, such as the Kolmogorov-Smirnov test, to ensure the reliability and accuracy of its findings, even when dealing with complex and variable system configurations. Researchers validated MT4G’s performance across ten different GPUs from both NVIDIA and AMD, demonstrating its universality and accuracy when compared to official documentation and reverse-engineering efforts.

The utility of MT4G was further confirmed through integration into three distinct workflows: GPU performance modeling, bottleneck analysis with GPUscout, and dynamic resource partitioning via sys-sage. These integrations highlight the tool’s ability to improve performance and optimise resource allocation for demanding HPC and AI workloads. The authors acknowledge limitations in benchmarking AMD L3 cache performance and plan further investigation into the CDNA3 microarchitecture. Future work includes extending bandwidth benchmarking to lower-level caches and incorporating metrics related to compute capability, such as floating-point operations per second for different data types. The team also intends to characterise specialised hardware engines like tensor cores, providing detailed insights into their capabilities. To maintain its relevance, MT4G will be validated on emerging GPU architectures, including NVIDIA Blackwell and AMD CDNA4, with the benchmark suite adapted to address new architectural features.

GPU Topology Discovery With Microbenchmarks

Scientists have developed MT4G, a new tool for automatically discovering and reporting GPU compute and memory topologies, addressing a significant gap in understanding modern high-performance computing systems. The work delivers a vendor-agnostic solution, functioning reliably on both NVIDIA and AMD GPUs, and provides detailed information about cache sizes, memory layouts, and accessibility for compute resources. MT4G combines existing vendor APIs with a comprehensive suite of over 50 microbenchmarks, enabling it to detect configurations and extract topological attributes not previously available through standard interfaces. The team implemented a variant of the Kolmogorov-Smirnov test to differentiate genuine pattern changes from measurement outliers, ensuring reliable results across diverse GPU architectures.

Experiments demonstrate MT4G’s functionality on all recent NVIDIA microarchitectures from Pascal onward, and all AMD CDNA GPUs, delivering both human-readable and machine-readable output. Measurements confirm that MT4G accurately reports critical properties such as cache line size, load latency, and memory sizes, providing a detailed map of the GPU’s internal organization. Researchers showcased MT4G’s versatility through three use cases: GPU performance modeling, bottleneck analysis with GPUscout, and dynamic resource partitioning. In these scenarios, MT4G provided essential topological information previously unavailable from a single source, supporting tasks in performance analysis and tuning. The tool accurately characterizes GPU architectures featuring tens of thousands of compute cores, organized hierarchically into Streaming Multiprocessors or Compute Units, and delivers a unified reporting mechanism for both NVIDIA and AMD GPUs.

👉 More information
🗞 MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
🧠 ArXiv: https://arxiv.org/abs/2511.05958

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Efficient LLM Inference Achieves Speedup with 4-bit Quantization and FPGA Co-Design

Efficient LLM Inference Achieves Speedup with 4-bit Quantization and FPGA Co-Design

January 9, 2026
Advances in Numerical Methods Unlock Bosonic Mixture Analysis with Continuous Matrix Product States

Advances in Numerical Methods Unlock Bosonic Mixture Analysis with Continuous Matrix Product States

January 9, 2026
Generative System Safety Advances Via Iterative Score Thresholding and Risk Prioritization

Generative System Safety Advances Via Iterative Score Thresholding and Risk Prioritization

January 9, 2026