AI Security: Data Poisoning Attacks Evade Defenses

Researchers are increasingly concerned with the vulnerability of machine learning models to data-poisoning attacks, which can compromise their integrity without immediate detection. Diego Granziol from the Mathematical Institute, University of Oxford, alongside collaborators, demonstrate a fundamental geometric mechanism underlying the success of these attacks. Their work proves that carefully clustered data poisons create a measurable spike in input curvature, directly correlating with attack efficacy. Significantly, the team identify a scenario where attacks remain potent yet become spectrally invisible, revealing an unavoidable trade-off between model safety and performance. This research provides the first comprehensive characterisation of data poisoning, its detectability, and effective defence strategies through analysis of input curvature, offering crucial insights for building more robust machine learning systems.

This work demonstrates that clustered dirty-label poisons induce a specific pattern, a rank-one spike, in the input Hessian of a model, with the magnitude of this spike scaling quadratically with the attack’s effectiveness.

Undetectable attacks survive in near-clone regime

Using kernel ridge regression as an accurate model for wide neural networks, researchers have proven that, crucially, a “near-clone regime” exists where attacks remain potent even as the induced input curvature vanishes, rendering them spectrally undetectable. This discovery establishes when backdoors become inherently invisible to standard detection methods, offering a new understanding of the interplay between attack design and model vulnerability.
Further investigation reveals that regularising the input gradient contracts poison-aligned Fisher and Hessian eigenmodes, creating an unavoidable trade-off between safety and efficacy by limiting the model’s data-fitting capacity. For exponential kernels, this regularisation functions as an anisotropic high-pass filter, effectively increasing the length scale and suppressing these near-clone poisons.

Extensive experiments conducted on both linear models and deep convolutional networks, utilising datasets such as MNIST, CIFAR-10, and CIFAR-100, validate the theoretical findings. These experiments consistently demonstrate a lag between attack success and spectral visibility, confirming the predictive power of the curvature-based analysis.

Defense strategies combined suppress poisoning attacks

Moreover, the combined application of regularisation and data augmentation proves effective in suppressing poisoning attacks, whereas data augmentation alone is insufficient. This research provides the first end-to-end characterisation of data poisoning, its detectability, and effective defence strategies through the lens of input-space curvature.

Kernel ridge regression and geometric analysis of data poisoning effects reveal vulnerabilities in machine learning models

Geometric analysis defines model behavior and vulnerabilities

Kernel ridge regression serves as the foundational model for this work, enabling a detailed analysis of the geometric mechanisms underlying data poisoning attacks. Researchers employed this technique to derive closed-form laws governing the impact of duplicated dirty-label poisons on the score, input Hessian, and input Fisher information.

The study began by defining a kernel ridge regression predictor, f(x), based on a set of training samples {(xi, yi)} and a positive-definite kernel k, with predictions calculated as a weighted sum of kernel functions. Ridge regression incorporated a ridge parameter, λ, to regularise the model and prevent overfitting, expressed mathematically as α = (K + nλI)−1y.

Gradient calculations were then performed to characterise the model’s sensitivity to input changes, specifically calculating ∇xf(x) and ∇2xL(x, y) to understand how the prediction function and loss landscape respond to perturbations. A cloned poison model was introduced, assuming a cluster of poisoned samples located at a trigger point, ζ, with a label yt, and the researchers fixed a trigger point x0 to analyse the impact of the poison block.

The scalar gain, S(m; λ), was defined to quantify the aggregate effect of the poison on the model’s predictions, and Lemma 3.2 established that the aggregate poison gain is equal to yt S(m; λ) under specific assumptions. Theorem 3.3 then demonstrated that the change in prediction at the trigger point, ∆f(x0), scales linearly with the number of poisoned samples, m, and the kernel function, k0, while Theorem 3.4 revealed a rank-one spike in the input Hessian, ΛGN(x0), whose magnitude scales quadratically with attack efficacy.

This spike is quantified by the Gauss-Newton term, ∥∇xf(x0)∥2, and is linked to the efficacy through the spike, efficacy law. Notably, the study identified a near-clone regime for exponential kernels where poison efficacy remains order one while the induced input curvature vanishes, rendering the attack spectrally undetectable.

Further analysis focused on the exponential kernel, where ∥∇xk(x0, ζ)∥2 = r2 l4 k2 0, and Corollary 3.7 demonstrated that in the near-clone regime, where ∥x0 −ζ∥≪l, poison efficacy remains constant while input curvature diminishes quadratically. Experimental validation, utilising principal component analysis on CIFAR-10 data, confirmed the assumption of poisons becoming near-clones in feature space, supporting the theoretical findings and demonstrating feature collapse.

Hessian spectral analysis characterises backdoor poisoning attack efficacy and detectability through curvature properties

Kernel ridge regression modelling of wide neural networks reveals that clustered dirty-label poisons induce a rank one spike in the input Hessian, with magnitude scaling quadratically with attack efficacy. Specifically, the research demonstrates that for sufficiently strong backdoor data poisoning, the top eigenvector of the input Hessian aligns with the poison direction, providing a diagnostic for detectability.

Analysis of the exponential kernel identifies a near-clone regime where poison efficacy remains order one, while the induced input curvature vanishes, rendering the attack spectrally undetectable. The study establishes a precise relationship between efficacy and input curvature, showing that efficacy grows linearly with the number of poisoned samples, while curvature increases quadratically.

For the exponential kernel, the research calculates that the Gauss-Newton spike factor is defined as r2 l4 ∆f(x0) 2, where r represents the distance between the trigger point and the poisoned samples, and l is the kernel length scale. In the near-clone regime, defined by r/l being much less than one, the induced input curvature diminishes, allowing for effective poisoning without spectral visibility.

Furthermore, the work proves that adding a term proportional to the square of the input gradient of the loss function demonstrably reduces the impact and efficacy of backdoor data poisoning, albeit at the cost of reduced data-fitting capacity. For the exponential kernel, this regularisation is interpreted as anisotropic damping, providing quadratic suppression of high-frequency modes.

Empirical validation across linear models and convolutional neural networks on MNIST, CIFAR 10, and CIFAR 100 consistently demonstrates a lag between attack success and spectral visibility. Joint application of regularisation and data augmentation effectively suppresses poisoning, whereas data augmentation alone does not. Increasing training duration further improves the safety-efficacy frontier.

Rank-one spikes in input Hessian reveal vulnerability to data poisoning attacks

Researchers have established a connection between the geometry of input space and the success of backdoor and data poisoning attacks on machine learning models. Using kernel ridge regression to model wide neural networks, they demonstrated that concentrated, maliciously labelled data introduces a specific pattern, a rank-one spike, in the input Hessian, with the size of this spike directly correlating with the attack’s effectiveness.

This geometric mechanism explains why existing spectral and optimisation-based defences often fail to detect these attacks. Importantly, the study identified a ‘near-clone’ regime where attacks remain highly effective despite inducing minimal curvature in the input space, rendering them undetectable by standard spectral methods.

Further investigation revealed that input gradient regularisation, a technique used to improve model robustness, operates by suppressing the specific modes of the Fisher and Hessian matrices aligned with the poison, but inevitably reduces the model’s overall data fitting capacity. For exponential kernels, this regularisation functions as a high-pass filter, effectively increasing the length scale and diminishing the impact of near-clone poisons.

Experiments on both linear models and convolutional neural networks, using datasets like MNIST and CIFAR, confirmed these theoretical findings, showing a consistent relationship between attack success and spectral visibility, and demonstrating the combined effectiveness of regularisation and data augmentation. The findings clarify conditions under which backdoors become inherently invisible to current detection methods and highlight the limitations of relying solely on post-hoc detection techniques.

The research demonstrates an unavoidable trade-off between safety and efficacy, meaning that defending against data poisoning necessarily involves a reduction in the model’s expressive power. Future work could focus on exploring methods to mitigate this trade-off or develop novel defence strategies that address the geometric vulnerabilities identified in this study, providing a more robust foundation for analysing attacks and defences in overparameterised models.

👉 More information
🗞 Safety-Efficacy Trade Off: Robustness against Data-Poisoning
🧠 ArXiv: https://arxiv.org/abs/2602.00822
Muhammad Rohail T.

Latest Posts by Muhammad Rohail T.: