Scientists are tackling the computational burden of attention mechanisms, a core component of many modern artificial intelligence architectures, which currently scale quadratically with input sequence length. Tobias Schröder from Imperial College London and Lester Mackey from Microsoft Research New England, alongside colleagues, present WildCat, a novel approach that achieves high accuracy at a significantly reduced computational cost. WildCat bypasses quadratic costs by focusing attention on a small, weighted coreset selected using a fast and spectrally-accurate subsampling algorithm. This research is significant because it demonstrates near-linear time complexity while maintaining super-polynomial error decay, offering a substantial improvement over existing approximations that either lack error guarantees or demand quadratic runtime for comparable fidelity. The team supports this advance with a GPU-optimized PyTorch implementation and benchmark experiments across image generation, classification, and language modelling.
Coreset construction via randomised Cholesky decomposition enables efficient attention approximation
Scientists have unveiled WildCat, a new technique for compressing the attention mechanism used in neural networks, achieving both high accuracy and low computational cost. Attention, a crucial component of modern artificial intelligence architectures, is often limited by its quadratic scaling with input sequence length, creating a significant barrier to deployment.
WildCat overcomes this limitation by focusing computations on a small, weighted subset of the input data, termed a coreset. This innovative approach selects the coreset using a spectrally-accurate subsampling algorithm, randomly pivoted Cholesky, and optimally weights the elements to minimise reconstruction error.
Remarkably, for bounded inputs, WildCat approximates exact attention with super-polynomial error decay while maintaining near-linear runtime. Prior practical approximations have typically lacked rigorous error guarantees or required quadratic runtime to achieve comparable fidelity. The researchers coupled this theoretical advance with a GPU-optimised implementation in PyTorch, facilitating practical application and benchmarking.
A suite of experiments demonstrates the benefits of WildCat across diverse tasks including image generation, image classification, and the compression of key-value caches in language models. Specifically, WildCat avoids the quadratic computational cost of standard attention by attending to only a small weighted coreset of r input keys.
These keys are selected in O(nr 2 ) time using a parallelised randomly pivoted Cholesky algorithm and reweighted optimally in O(nrd) time to minimise attention reconstruction error. Consequently, WildCat achieves near-linear runtime whenever r is substantially smaller than n. The selection rule and optimal reweighting enable WildCat to approximate the attention output with near-optimal low-rank approximation error.
Given bounded attention inputs, a near-constant coreset size r is sufficient for super-polynomial error decay of O(n −√log(log(n)) ). This represents a significant improvement over existing methods, which either require quadratic time for fast error decay or only guarantee slow, near-constant error decay in near-linear time.
Further, the work demonstrates that WildCat can deliver super-polynomial error decay in near-linear time even when input entries or dimensions grow super-logarithmically in n. Benchmark experiments reveal that WildCat generates higher-quality outputs more quickly than five leading attention approximations for image generation and classification. Moreover, it reduces the memory requirements of long-context language models more effectively than five leading key-value cache compression methods.
Coreset construction and reweighting for efficient attention approximation
A randomly pivoted Cholesky algorithm forms the core of WildCat, a new approach to compressing the attention mechanism in neural networks. This algorithm efficiently selects a coreset of r input keys from a sequence of length n in O(nr 2 ) time, enabling a substantial reduction in computational cost.
The selected keys are then optimally reweighted to minimise attention reconstruction error, a process completed in O(nrd) time, where d represents the dimension of the keys and values. Consequently, WildCat achieves near-linear runtime complexity when r is proportional to n 0.2 or smaller. The methodology prioritises spectral accuracy, allowing for super-polynomial error decay with a bounded input.
WildCat approximates the attention output using this coreset, achieving an error decay of O(n -√log(log(n)) ) with a near-constant coreset size. This performance surpasses prior practical approximations that either lack error guarantees or require quadratic runtime for comparable fidelity. The research demonstrates that WildCat can deliver super-polynomial error decay in near-linear time, even when input entries or dimensions grow super-logarithmically in n.
To validate its efficacy, the study incorporated a GPU-optimised PyTorch implementation of WildCat. Benchmark experiments were conducted across image generation, image classification, and long-context language model KV cache compression. These experiments compared WildCat against five leading attention approximations and five KV cache compression methods, respectively.
Results indicated that WildCat generated higher-quality outputs more quickly and reduced memory requirements more effectively than existing techniques, demonstrating its practical benefits in various applications. The work defines notation including defining the set [n] as {1, 2, ., n} and utilising standard matrix norms such as the operator norm and the max norm.
Super-polynomial error decay and near-linear runtime for attention compression
WildCat achieves high-accuracy attention compression with a demonstrated error rate that exhibits super-polynomial decay while maintaining near-linear runtime. The research introduces a method that approximates exact attention with error that decreases at a super-polynomial rate, given bounded inputs, and operates in near-linear time.
This is accomplished by attending only over a small, weighted coreset selected using a spectrally-accurate subsampling algorithm, randomly pivoted Cholesky, and weighting elements optimally to minimise reconstruction error. The core of the work, Algorithm 4, WILDCAT, delivers an approximation guarantee where the maximum error, E∥O −bOr∥max, is bounded by 3∥V∥maxn−a, provided the rank parameter r meets a specific threshold of 1 + 1 √πn(σ+δ)Ent( σ σ+δ) log n2a+σ+3γ.
Specifically, for B 1 bins, the same result holds with adjusted effective sequence length and rank parameters. Experimental results, detailed in Table 1, show that WildCat achieves O(n−Ω(log(log(n)))t) error decay, surpassing the polynomial decay of methods like Thinformer, BalanceKV, KDEformer, and HyperAttention, all within comparable runtime complexities.
Furthermore, under conditions where βRQRK is in O(log(n)α), d is negligible compared to log(n), and a(n) is slower growing than log log(n), the research demonstrates that WildCat can achieve error decay of n−a(n) with a rank parameter r remaining within no. This performance is particularly significant as it extends super-polynomial error decay to scenarios where both dimension and entries grow with sequence length. The compressed keys and values, requiring only O(rd) storage, can be integrated into autoregressive models for KV cache compression or used within a custom attention module for non-autoregressive models.
Coreset construction via Cholesky decomposition enables scalable attention mechanisms
Scientists have developed WildCat, a new method for compressing the attention mechanism used in modern artificial intelligence architectures. Attention mechanisms, while powerful, are computationally expensive due to their quadratic scaling with input sequence length. WildCat addresses this limitation by focusing attention on a small, weighted subset of the input data known as a coreset.
This coreset is selected using a fast and accurate subsampling algorithm based on Cholesky decomposition, with weights optimised to minimise reconstruction error. Remarkably, WildCat achieves super-polynomial error decay while maintaining near-linear runtime for bounded inputs, a significant improvement over existing practical approximations.
The approach supports unbounded entries and super-logarithmic dimension growth, offering greater flexibility than previous near-linear time theories which typically guarantee only polynomial error decay. Furthermore, WildCat facilitates substantial reductions in the memory footprint of KV caches, enabling more efficient long-context inference.
The authors acknowledge that the method’s performance relies on certain assumptions regarding the input data and model parameters. Future research could explore applications of this compression technique to a wider range of models and datasets, as well as investigate methods for adapting the algorithm to handle more complex input distributions.
👉 More information
🗞 WildCat: Near-Linear Attention in Theory and Practice
🧠 ArXiv: https://arxiv.org/abs/2602.10056
