Training-free Policy Violation Detection Via Activation-Space Whitening Enables Robust LLM Alignment

As organisations increasingly integrate large language models into sensitive areas like finance and legal services, ensuring these systems adhere to internal policies becomes paramount. Oren Rachmil, Roy Betser, Itay Gershon, and colleagues at various institutions now present a novel method for detecting policy violations within these models, offering a significant step beyond existing safety filters. Their research tackles the limitations of current approaches, such as slow response times and a lack of transparency, by framing policy violation detection as an out-of-distribution problem. The team demonstrates that a simple transformation of the model’s internal workings, effectively ‘whitening’ the activation space, allows for accurate identification of non-compliant outputs without any additional training, achieving state-of-the-art performance on challenging benchmarks and offering a practical framework for responsible AI governance.

Whitening Improves Policy-Guided Dialogue Calibration

Scientists developed a new technique for ensuring large language models (LLMs) adhere to specified policies during dialogue generation. By applying whitening to the internal activations of LLMs, researchers aim to create a clearer representation space where policy violations become more easily detectable. The core idea is that policy-specific boundaries exist within the model’s internal workings, and whitening sharpens these boundaries, improving the model’s ability to stay within defined limits. The method involves extracting hidden activations and utilizing a calibration dataset containing both policy-compliant and violating examples.

Researchers calculate the mean and covariance of in-policy activations, performing a mathematical decomposition to obtain key parameters used to create a whitening transformation. By identifying the layer that maximizes the separation between in-policy and out-of-policy samples, scientists pinpoint the most effective point for detecting violations. A threshold is then applied to a metric like Mahalanobis distance to determine if a response is out-of-policy. Results demonstrate that this whitening-based approach significantly improves the calibration of LLMs, achieving higher accuracy on benchmark tests compared to existing methods. Applying whitening independently for each policy category further enhances performance, suggesting that different policies are encoded in distinct regions of the model’s internal representation. This research offers a practical and effective solution for building more reliable and trustworthy dialogue systems.

Hidden Activations Detect Policy Violations

Scientists developed a novel method for detecting policy violations in large language models (LLMs) without requiring model fine-tuning or external evaluators. This research addresses the challenge of ensuring LLMs comply with complex organizational policies, a critical need as these models are increasingly deployed in sensitive domains. Researchers treat policy violation detection as an out-of-distribution (OOD) problem, inspired by whitening techniques commonly used in image processing, and center on analyzing hidden activations within the LLM. The team established an in-distribution manifold representing policy-compliant user-LLM interactions and applied a data-driven whitening transform to these activations, standardizing features to achieve approximately identity covariance.

In this whitened space, policy compliance is quantified by calculating the Euclidean norm of the whitened activation vector, providing a single compliance score for each interaction. To implement the method, scientists collected a small number of illustrative samples and the organization’s policy text, avoiding the need for extensive training data. At runtime, the team computes a compliance score on a selected operational layer and compares it to a pre-defined calibrated threshold, flagging responses exceeding the threshold as potential policy violations. This approach enables real-time monitoring and large-scale deployment, offering a flexible and low-overhead solution for continuous policy updates and oversight.

Hidden Activation Whitening Detects Policy Violations

Scientists developed a novel method for detecting policy violations in large language models (LLMs) without requiring any additional training or fine-tuning of the models themselves. This work addresses the critical need for organizations to ensure LLMs comply with internal policies and external regulations, particularly in sensitive domains. The team framed policy compliance as an out-of-distribution (OOD) detection problem, analyzing hidden activations within the LLM to identify deviations from expected behavior. Inspired by image processing techniques, researchers applied a whitening transformation to the model’s hidden activations, effectively decorrelating and standardizing the data to create a near-identity covariance matrix.

This transformation centers in-policy activations near the origin in the whitened space, while policy violations shift outward, creating a clear distinction between compliant and non-compliant responses. Compliance is then scored by calculating the Euclidean norm of the whitened activation vector, providing a single, interpretable metric for assessing policy adherence. Experiments on the challenging DynaBench policy dataset demonstrate the effectiveness of this approach, achieving state-of-the-art results and outperforming both an LLM-as-a-judge and a fine-tuned 8B DynaGuard model by up to 9%. Results show that in the whitened space, in-policy samples cluster closely around the origin with lower norms, while out-of-policy samples exhibit higher norms, enabling clear separation of compliant and violating responses.

Whitening Activations Detect Policy Violations

This work introduces a novel framework for detecting policy violations when deploying large language models, achieving significant advances in trustworthy AI governance. Researchers addressed the limitations of existing content moderation techniques by framing the problem as out-of-distribution detection, rather than relying on fine-tuning or external judges. The method involves whitening hidden activations within the language model to decorrelate representations and standardise them, then scoring compliance using the Euclidean norm.

👉 More information
🗞 Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs
🧠 ArXiv: https://arxiv.org/abs/2512.03994

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

January 10, 2026
Sharma-mittal Entropy Advances Quantum Speed Limits for Finite-Dimensional Systems

Si/sige T-Junction Achieves 99% Electron Transfer Fidelity for Scalable Quantum Computing

January 9, 2026
Order 2 Quantum Wasserstein Distance Advances State Discrimination for Gaussian States

AI Achieves Majorana Modes in Quantum Dot Hamiltonians with Single-Step Tuning

January 9, 2026