Researchers are increasingly focused on efficiently modifying the behaviour of large language models (LLMs), and a new study by Zijian Feng, Tianjiao Li, and Zixiao Zhu, from the School of Electrical and Electronic Engineering at Nanyang Technological University, Singapore, alongside Hanzhang Zhou, Junlang Qian, and Li Zhang, demonstrates a significant advance in this field. Their work reveals that current activation steering methods, which operate at a block level, are often imprecise due to the inherent heterogeneity within those blocks, unnecessarily manipulating both useful and detrimental features. By introducing a technique called AUSteer, which operates at a much finer, ‘activation unit’ level, the team shows that targeted intervention on only the beneficial components of an LLM can dramatically improve steering performance while reducing the number of activations adjusted. This research establishes that more precise steering, achieved by intervening less, yields demonstrably superior results across multiple LLMs and tasks.

Fundamental limitations of block-level LLM control

Limitations of block-level LLM control methods

Heterogeneity of block activations limits precise large language model control

Scientists have uncovered a fundamental limitation in current methods of controlling large language models (LLMs), revealing that intervening at the block level, modifying bundled activations within attention heads or networks, is inherently inefficient. This work demonstrates that block-level activations are surprisingly heterogeneous, containing a mixture of beneficial, irrelevant, and even harmful features.
Consequently, steering at this coarse granularity inadvertently amplifies unwanted signals, reducing precision and potentially degrading performance. To address this, researchers decomposed block activations into fine-grained “atomic unit” (AU)-level activations, effectively treating each dimension of the block as an independent control point.

Theoretical and empirical analysis confirms that this heterogeneity arises because different dimensions, or AUs, govern distinct aspects of the LLM’s output token distributions. Block-level steering, therefore, inevitably mixes helpful and harmful influences, diminishing its effectiveness. The study establishes that restricting intervention to only the beneficial AUs yields significantly more precise and efficient control.

Introducing AUSteer for precise model steering

Proposing AUSteer for enhanced model steering

Building on this insight, the team proposes AUSteer, a novel method that operates at the AU level, identifying discriminative AUs using a new metric called activation momentum calculated on contrasting samples. AUSteer then assigns adaptive steering strengths tailored to both the input and the selected AU activations.

Comprehensive experiments across multiple LLMs and tasks demonstrate that AUSteer consistently outperforms advanced baseline methods while steering a considerably smaller number of activations. This finding underscores a critical principle: steering less, in this case, achieves more. The research highlights a pathway towards more targeted and resource-efficient control of LLMs, potentially enabling more nuanced and effective adaptation for specific tasks and applications.

Decomposition into atomic units and dimensions

Decomposition of block activations into atomic units and identification of steering dimensions

Decomposing activations into atomic units

Atomic unit (AU)-level activations form the basis of this research, representing single dimensions of block activations and corresponding to slices of the block weight matrix. The study decomposes block activations, typically from multi-head attention (MHA) or feedforward networks (FFN) within a transformer layer, into these finer-grained AU activations to address limitations of existing block-level steering methods.

This decomposition allows for targeted intervention on individual dimensions rather than bundled vectors, enabling a more precise approach to modifying large language model (LLM) behaviours. To identify discriminative AUs, the work computes activation momenta on contrastive samples. This process globally assesses the impact of each AU, effectively highlighting those most relevant for steering.

Subsequently, adaptive steering strengths are assigned to these selected AU activations, tailoring the intervention to diverse inputs and maximizing effectiveness. This contrasts with prior methods that apply uniform steering across entire blocks, potentially amplifying irrelevant or harmful signals. Experiments involved multiple LLMs and tasks, consistently demonstrating that AUSteer, the proposed AU-level steering method, surpasses advanced baselines.

Crucially, AUSteer achieves superior performance while steering considerably fewer activations than existing techniques. This demonstrates the efficiency of focusing intervention on beneficial AUs, reducing the intervention footprint and minimising potential performance degradation. The research confirms that heterogeneity exists within block activations, with different AUs controlling distinct token distributions in LLM outputs, and that restricting intervention to beneficial AUs yields more precise and effective steering.

Selective dimension steering optimises performance in attention and feedforward networks

Intervention on the 7th attention head at the 27th layer revealed highly heterogeneous outcomes when steering individual activations, with accuracy ranging depending on the specific dimension manipulated. The original model achieved 70.52% accuracy, while intervention using ITI reached 71.56% and SADI attained 73.70%.

Steering the 84th dimension alone resulted in 74.53% accuracy, exceeding the performance of the baseline, ITI, and SADI. Further refinement through steering only four positively contributing dimensions, termed Positive Combination, yielded even stronger results. Introducing detrimental dimensions into the combination, however, caused a performance drop.

Similar observations were made when examining FFN blocks, demonstrating that steering less achieves more. Dimension Sweep, steering single dimensions rather than whole blocks, sampled one of every four dimensions in attention heads and one of every 100 in FFNs. Analysis of the 7th attention output in layer 27 and the FFN output in layer 20 showed that individual dimension steering could outperform full block steering.

For example, steering dimensions 0, 1300, 2300, and 3000 in the FFN block achieved improved performance. Theoretical and empirical analysis confirmed that this heterogeneity arises because different activations, or dimensions, control distinct token distributions in LLM outputs. Steering task-relevant activations promotes the probability of task-specific tokens, while steering irrelevant activations may increase the probability of uninformative or harmful tokens.

Examining the convergence behavior of AU steering, researchers observed that different activations govern different output token distributions. Scaling the activation coefficient from 10 to 100,000 and computing the normalized KL divergence between outputs at each strength and 100,000 demonstrated convergence.

Furthermore, steering the 84th activation promoted the correct answer token “yes” while suppressing “no” in a BoolQ dataset question, improving accuracy. Conversely, steering activations 44 or 100 elevated task-irrelevant or incorrect tokens, degrading performance. Pairwise KL divergence between the 44th and 84th activations increased with strength, indicating that these activations drive the model toward different output distributions. These findings support the conclusion that block-level steering is inefficient due to mixing beneficial and detrimental components, while fine-grained AU-level steering enables selective amplification of useful features.

Activation unit steering refines large language model control

Researchers have developed a new method, AUSteer, for modifying the behaviour of large language models (LLMs) with improved efficiency. Existing techniques typically adjust LLM behaviour at the level of entire blocks of data, but this approach was found to be imprecise due to the varied and entangled information contained within those blocks.

AUSteer instead operates at a much finer granularity, focusing on individual dimensions, termed activation units (AUs), within these blocks. By selectively steering only the most relevant AUs, the method achieves more targeted and effective control over LLM outputs. The core innovation lies in identifying these salient AUs using activation momentum analysis on contrasting examples, and then applying adaptive steering strengths tailored to each input.

Extensive experimentation across multiple LLMs and tasks demonstrates that AUSteer consistently outperforms existing state-of-the-art methods while intervening in significantly fewer activations, confirming the principle that steering less can achieve more. This suggests that a more precise approach to LLM behaviour modification is both possible and beneficial.

The authors acknowledge that their method relies on identifying beneficial AUs, and they explored both promoting these units and suppressing unhelpful ones, finding promotion consistently yielded better results. Future work could investigate the optimal balance between promotion and suppression strategies.

Further research may also explore the application of this fine-grained steering approach to other areas of LLM control, such as enhancing robustness or mitigating biases, and could extend to multimodal models. These findings establish a clear path toward more efficient and targeted control of large language models, reducing the computational cost and potential intrusiveness of behaviour modification techniques.

👉 More information
🗞 Fine-Grained Activation Steering: Steering Less, Achieving More
🧠 ArXiv: https://arxiv.org/abs/2602.04428

Tags:

Large Language Models

Muhammad Rohail T.

AI Fine-Tuning Boosts Precision & Control