Researchers have identified a significant inefficiency in running causal transformer models: redundant computation when processing batches containing sequences with shared prefixes. Michael Feil from Baseten, Julius Lipp from an independent research capacity, and David Zhou from Streamlit, alongside their colleagues, demonstrate this problem and introduce RadixMLP , a novel technique to tackle it head-on. RadixMLP cleverly exploits the structure of MLPs and other position-wise operations to deduplicate computations across shared prefixes, effectively compressing the representation and accelerating inference. Benchmarks utilising Qwen3 models on the MS MARCO v1.1 dataset reveal speedups of 1.44 to 1.59x in realistic reranking workloads, marking a substantial improvement for applications reliant on efficient transformer processing.

RadixMLP accelerates causal transformers via prefix sharing, achieving

Scientists have developed RadixMLP, a novel technique to dramatically accelerate batch inference for causal transformer models by eliminating redundant computations. The research addresses a key inefficiency in standard inference engines, which independently process sequences even when they share common prefixes, such as system prompts or shared queries, leading to repeated calculations of identical activations. RadixMLP exploits the position-wise nature of crucial transformer components like MLPs, LayerNorms, linear projections, and embeddings to achieve this optimisation. This innovative approach dynamically maps batches to a prefix trie, effectively compressing shared segments for efficient position-wise computation and scattering results only at attention boundaries.
The team achieved a stateless implementation, meaning RadixMLP operates within a single forward pass, avoiding the complexities of persistent state management required by KV caching methods. This is particularly advantageous for batch workloads where maintaining caches can be challenging. RadixMLP functions by recognising that tokens with identical causal history, those following the same prefix path, require identical computations for position-wise operations. By gathering these shared segments into a compact representation, the technique significantly reduces arithmetic inefficiency. Experiments demonstrate that RadixMLP is also compatible with autograd, potentially opening avenues for further performance gains in training systems.

This breakthrough reveals substantial speedups in end-to-end serving benchmarks using Qwen3 models, ranging from 0.6B to 8B parameters. Specifically, RadixMLP achieves speedups of 1.44, 1.59x in realistic reranking workloads on the MS MARCO v1.1 dataset. Furthermore, synthetic benchmarks with longer shared prefixes showcased even more impressive results, with speedups reaching up to 5x. The research establishes a new paradigm for efficient transformer inference, moving beyond padding and embracing ragged layouts to maximise GPU utilisation. The study unveils an efficient gather/scatter mechanism coupled with CPU-side index pre-computation, ensuring practical implementation and integration. RadixMLP has been open-sourced and upstreamed into both TEI and Candle, facilitating wider adoption and further development. By eliminating redundant computations and streamlining the inference process, this work opens exciting possibilities for deploying large language models more efficiently and cost-effectively, paving the way for faster and more accessible AI applications.

RadixMLP for efficient causal transformer batch inference achieves

Scientists pioneered RadixMLP, a novel technique to accelerate batch inference for causal transformer models by eliminating redundant computations within shared sequence prefixes. The study addresses inefficiencies in standard inference where identical MLP activations are repeatedly calculated for shared prefixes across sequences, such as system prompts or few-shot examples. RadixMLP dynamically maps batches to a prefix trie, effectively compressing shared segments for position-wise computation and scattering results only at attention boundaries, all within a single forward pass. This stateless approach significantly reduces computational overhead without requiring modifications to the model architecture.

Researchers engineered a system that leverages the position-wise nature of MLPs, LayerNorms, linear projections, and embeddings to achieve this deduplication. Experiments employed an NVIDIA H100 (80GB) GPU with FlashAttention-2 to evaluate RadixMLP’s performance across three variants of the Qwen3 embedding models: Qwen3-0.6B, Qwen3-4B, and Qwen3-8B, all running in float16 precision. The team utilized the text-embeddings-inference (TEI) framework with a candle-cuda backend and vllm 0.13.0, configuring a block size of 32 for consistent measurements. The study constructed three distinct inference benchmarks to rigorously assess RadixMLP’s efficacy.

A synthetic benchmark varied prefix and suffix lengths within batches of 32 sequences, simulating query-style sharing (32, 256 tokens) and instruction-style sharing (512, 2048 tokens). A real-world benchmark utilized the MS MARCO v1.1 validation split, pairing queries with passage options using the Qwen3Reranker template, resulting in sequences of 75, 200 tokens. Finally, an augmented MS MARCO v1.1 dataset with a flipped query-document order and shortened system prompt (65-200 tokens) was used to explore scenarios with reduced prefix sharing. Results demonstrate that RadixMLP achieves speedups of 1.44×, 1.56×, and 1.59× on the Qwen3-0.6B, Qwen3-4B, and Qwen3-8B models respectively, when applied to the MS MARCO v1.1 dataset with a maximum batch token size of 65536. Synthetic benchmarks revealed that longer shared prefixes (up to 2048 tokens) yield greater speedups, with the Qwen3-8B model achieving up to a 5.0× acceleration, highlighting the method’s scalability and potential for substantial performance gains in realistic reranking workloads. The trie construction and index computation are performed asynchronously by the CPU scheduler, ensuring that all reported timings reflect pure GPU execution.

RadixMLP boosts transformer inference speed significantly

Scientists achieved significant speedups in causal transformer model inference through a novel technique called RadixMLP, demonstrating a breakthrough in batch processing efficiency. The research team measured a 1.44, 1.59× speedup in realistic reranking workloads using MS~MARCO v1.1 and Qwen3 models ranging from 0.6B to 8B parameters. Experiments revealed that RadixMLP dynamically maps batches to a prefix trie, effectively gathering shared segments into a compressed representation for position-wise computation and scattering results back only at attention boundaries. This innovative approach eliminates redundant computations of MLP activations for shared prefixes within batches, substantially improving performance.

The team measured up to 5× speedups on synthetic benchmarks featuring longer shared prefixes, highlighting the technique’s scalability with increased redundancy. RadixMLP operates within a single forward pass and is entirely stateless, a critical advantage for batch workloads where persistent state management can be challenging. Tests prove that RadixMLP is compatible with autograd, potentially opening avenues for further performance gains in training systems. Data shows the technique exploits the position-wise nature of MLPs, LayerNorms, linear projections, and embeddings to achieve these gains, focusing computation only where it is needed.

Results demonstrate that RadixMLP efficiently handles ragged inference by scattering data only at attention operations and gathering it afterward, ensuring causal correctness is maintained. Scientists recorded that the technique’s efficiency stems from recognizing that tokens sharing identical causal history require identical position-wise computations. The breakthrough delivers a substantial reduction in arithmetic inefficiency, particularly during prefill workloads where position-wise components account for a large fraction of FLOPs. Measurements confirm that the MLP block, with its three matrix multiplications contributing approximately 6d · dint FLOPs per token, is a major computational bottleneck addressed by RadixMLP.

Furthermore, the study details the implementation of efficient gather/scatter kernels and CPU-side index pre-computation, enhancing the practicality of RadixMLP. The work introduces a stateless approach to prefix deduplication, offering an alternative to persistent KV caches while maintaining compute reuse within a single forward pass. The team has open-sourced RadixMLP, integrating it into TEI and Candle, facilitating wider adoption and further research. This advancement promises to accelerate inference speeds for a range of applications, including embedding models, cross-encoder rerankers, and classification systems.

RadixMLP accelerates causal transformers via prefix sharing, achieving

Scientists have developed RadixMLP, a novel technique designed to accelerate batch inference for causal transformer models. This approach addresses redundancy in processing sequences with shared prefixes, common in tasks like system prompts, few-shot examples, and shared queries, by dynamically mapping batches to a prefix trie. RadixMLP compresses shared segments for efficient position-wise computation, scattering results only at attention boundaries, and operates within a single forward pass without requiring statefulness. Researchers demonstrated significant speed improvements using RadixMLP with Qwen3 models ranging from 0.6 to 8 billion parameters.

End-to-end serving benchmarks on the MS MARCO v1.1 dataset showed speedups of 1.44 to 1.59 in realistic reranking workloads, and even greater gains were observed in synthetic benchmarks featuring longer shared prefixes. Ablation studies confirmed that RadixMLP maintains gradient correctness for modified position-wise components, with any observed differences in backward passes primarily attributable to factors like precision and attention kernel implementations, rather than the compaction process itself. The authors acknowledge that further investigation is needed to assess the technique’s impact on large-scale training runs. Future research directions could explore the effects of RadixMLP in more extensive training scenarios and potentially investigate its compatibility with a wider range of transformer architectures. However, the current findings establish RadixMLP as a promising method for enhancing the efficiency of causal transformer inference, particularly in applications involving substantial prefix sharing.

👉 More information
🗞 RadixMLP — Intra-batch Deduplication for Causal Transformers
🧠 ArXiv: https://arxiv.org/abs/2601.15013

Tags:

attention boundaries! Causal transformer models LayerNorm MLP activations MS MARCO v1.1 prefix trie Qwen3 models RadixMLP sequence batching

Radixmlp Achieves Faster Causal Transformer Inference by Eliminating Redundancy on MS~MARCO V1.1

RadixMLP accelerates causal transformers via prefix sharing, achieving

RadixMLP for efficient causal transformer batch inference achieves

RadixMLP boosts transformer inference speed significantly

RadixMLP accelerates causal transformers via prefix sharing, achieving

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks