Researchers are tackling the critical challenge of understanding performance bottlenecks within large language model (LLM) inference engines. Bohua Zou, Debayan Roy, and Dhimankumar Yogesh Airao from Huawei Hilbert Research Center, alongside colleagues including Weihao Xu and Binqi Sun from Technical University of Munich, present ProfInfer , a novel, fine-grained profiling framework built on extended Berkeley Packet Filter (eBPF) technology. This work is significant because current LLM systems lack the operator-level visibility common in other engines, hindering optimisation and resource management; ProfInfer dynamically probes runtime functions without code modification, offering rich visualisations of operator behaviour and hardware trends with minimal overhead , less than 4% , and finally making LLM inference truly transparent and diagnosable.

LLM Inference Bottlenecks Identified via ProfInfer reveal key

Scientists have developed a novel profiling framework, ProfInfer, to address the critical need for real-time performance understanding in large language model (LLM) inference engines. Unlike general-purpose engines like ONNX Runtime, current LLM inference systems lack operator-level visibility, often leaving developers unable to pinpoint performance bottlenecks. The research team tackled this challenge by creating a fine-grained, non-intrusive profiling system applicable to runtimes such as llama. cpp, but adaptable to similar architectures. These visuals expose the behaviour of dense inference, Mixture-of-Experts routing, and operator offloading in practical scenarios. The team achieved less than 4% runtime overhead while maintaining high profiling fidelity, effectively making LLM inference transparent and diagnosable. This breakthrough transforms performance profiling into a practical tool for optimisation, scheduling, and resource-aware deployment of LLMs, particularly on constrained devices.
The study establishes a system capable of capturing every forward pass and operator invocation, including tensor dimensions and operator types, alongside the dynamically constructed computational DAG. ProfInfer ensures non-intrusive instrumentation through eBPF probes, enabling easy deployment on Linux-based mobile and edge operating systems. Furthermore, the framework supports integration with hardware counters, collecting performance monitoring counter (PMC) data like cache misses and memory accesses, and presenting this data through intuitive timeline views, DAG visualizations, and per-operator plots. Researchers implemented ProfInfer over the llama. cpp inference engine, demonstrating its ability to analyse compute and memory bottlenecks, KV-cache effects, workload interference, and backend performance divergence. The work opens new avenues for understanding how quantization, KV-cache reuse, and accelerator offloading impact end-to-end latency and efficiency on-device. By correlating low-level hardware metrics with high-level operator semantics, ProfInfer provides developers with the necessary observability to optimise LLM performance in challenging mobile and edge environments.

EBPF Profiling of LLM Inference Engines is gaining

Scientists engineered a fine-grained, non-intrusive profiling framework to analyse modern Large Language Model (LLM) inference engines, exemplified by llama. cpp, but designed for applicability to similar runtimes. This approach enables detailed observation of operator behaviour within LLM inference, addressing a critical gap in existing tools which often lack operator-level visibility. Researchers collected trace data using the eBPF probes, compiling it into rich visualizations encompassing operators, graphs, timelines, and hardware counter trends.
These visualizations expose the behaviour of dense inference, Mixture-of-Experts routing, and operator offloading in practical scenarios, providing insights previously unavailable to developers. The system delivers high profiling fidelity while maintaining a low runtime overhead of less than 4%, a crucial achievement for practical deployment and continuous monitoring. Experiments employed a configuration where probes were strategically inserted into key functions within the llama. cpp runtime, capturing data on execution time, memory access patterns, and hardware resource utilisation. The team harnessed the power of eBPF to avoid the performance penalties associated with traditional profiling methods, such as instrumentation or sampling.

This was achieved by leveraging eBPF’s ability to run kernel-level code with minimal overhead, allowing for the collection of detailed performance data without significantly impacting the LLM’s inference speed. Data collection involved capturing timestamps, function arguments, and return values for each probed function, creating a comprehensive trace of the LLM’s execution flow. The resulting traces were then processed and visualised using custom tools, enabling researchers to identify performance bottlenecks and optimise resource allocation. This methodology achieves a level of transparency and diagnosability previously lacking in LLM inference systems, turning performance profiling into a practical tool. The framework’s ability to pinpoint whether a workload is memory-bound or compute-bound, for example, is a direct result of the detailed hardware counter trends captured by the eBPF probes. By providing this granular level of insight, the study’s innovative approach facilitates optimisation, scheduling, and resource-aware deployment of LLMs, ultimately paving the way for more efficient and scalable AI applications.

ProfInfer delivers low-overhead LLM performance profiling for developers

Scientists have developed ProfInfer, a novel eBPF-based profiling framework designed to provide fine-grained, non-intrusive performance analysis of large language model (LLM) inference engines. The research team successfully implemented ProfInfer over the llama. cpp inference engine, achieving a runtime overhead of less than 4% while maintaining high profiling fidelity, a crucial step towards making LLM inference transparent and diagnosable. This breakthrough delivers a practical tool for optimisation, scheduling, and resource-aware deployment of LLMs in diverse environments. Experiments revealed that ProfInfer dynamically attaches probes to runtime functions across multiple layers without modifying or recompiling the source code.

The system continuously gathers and logs data from these probes in the kernel space, selectively disabling unnecessary probes to minimise overhead. Offline analysis parses the collected results, identifying computations and backend types, then presents structural and performance metrics at the token, graph, and operator levels. Measurements confirm that ProfInfer enables comprehensive visibility across the entire inference pipeline, covering forward passes, computation graphs, operator execution, and processor threads, while collecting a rich set of performance metrics through hardware counters. The team measured performance using key metrics including time to the first token (TTFT) in the prefill stage and tokens per second (TPS) during the decoding iteration.

Data shows that ProfInfer facilitates analysis of compute and memory bottlenecks, KV-cache effects, workload interference, and backend performance divergence. Specifically, the framework exposes how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice, providing insights into dynamic characteristics of MoE models. Tests prove that the framework’s intuitive performance analytics, including timeline views, DAG visualizations, and operator-level plots, effectively correlate model structure with hardware behaviour. Researchers demonstrated the framework’s ability to analyse the impact of sparsity within LLMs, including dynamic pruning and mixture-of-experts architectures, on on-device inference performance.

The work highlights how activating only a portion of weights online can accelerate inference, while also identifying potential runtime memory and disk I/O overheads. Furthermore, the study showcases ProfInfer’s capacity to assess the performance fluctuations introduced by dynamic workloads, such as those encountered in MoE models where the frequency of expert activation impacts runtime performance. This system dynamically attaches probes to runtime functions without requiring source code modification or recompilation, offering operator-level visibility currently lacking in many LLM inference systems. The framework collects traces and visualises them as operator graphs, timelines, and hardware counter trends, revealing the behaviour of dense inference, Mixture-of-Experts routing, and operator offloading in practical scenarios. Researchers demonstrated the framework’s effectiveness with less than 4% runtime overhead and high fidelity, enabling transparent and diagnosable LLM inference.

This allows for practical performance profiling, facilitating optimisation, scheduling, and resource-aware deployment of these models. Analysis using the framework revealed that decoding speed in dense models correlates linearly with hyperparameters, and that utilising libraries like BLIS can double prefill speed by improving cache locality on CPUs. Furthermore, the study showed that selective operator offloading to GPUs, based on tensor dimension, can potentially accelerate inference, although performance drops for large matrix sizes like those found in LM Heads. The authors acknowledge that the framework’s current implementation has limitations regarding the size of OpenCL kernels, preventing offloading of certain layers.

Future work could focus on extending the framework to support a wider range of LLM inference engines and hardware accelerators. The findings are significant because they provide a practical tool for understanding and optimising the performance of LLM inference, which is crucial as these models become increasingly prevalent. By exposing detailed performance characteristics, the framework empowers developers to make informed decisions about model deployment and resource allocation, ultimately leading to more efficient and scalable LLM applications.

👉 More information
🗞 ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler
🧠 ArXiv: https://arxiv.org/abs/2601.20755

Tags:

eBPF Large Language Models llama.cpp LLM Inference Mixture-of-Experts operator-level visibility performance profiling profiling framework runtime overhead!

Profinfer Achieves 4% Performance Gain with Fine-Grained LLM Inference Profiling

LLM Inference Bottlenecks Identified via ProfInfer reveal key

EBPF Profiling of LLM Inference Engines is gaining

ProfInfer delivers low-overhead LLM performance profiling for developers

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks