Feature-Space Smoothing Achieves Certified Robustness for Multimodal Large Language Models

Multimodal large language models (MLLMs), despite their impressive performance, suffer from susceptibility to adversarial attacks which can manipulate their internal feature representations and lead to incorrect outputs. Researchers Song Xia, Meiwen Ding and Chenqi Kong, all from Nanyang Technological University, alongside Wenhan Yang from Peng Cheng Laboratory and Xudong Jiang from Nanyang Technological University, have now developed a novel method, Feature-space Smoothing (FS), to provably enhance the robustness of these models. Their work establishes a theoretical framework guaranteeing certified robustness on feature representations, ensuring a minimum level of similarity between clean and maliciously altered data, a significant step towards trustworthy AI systems. By introducing the Purifier and Smoothness Mapper (PSM) module, the team further boosts this robustness without requiring costly retraining, demonstrably reducing attack success rates from almost 90% to a mere 1% across a range of MLLMs and tasks.

Certified Robustness via Feature-space Smoothing offers provable guarantees

Scientists have developed a novel defence mechanism, Feature-space Smoothing (FS), to bolster the robustness of Multimodal Large Language Models (MLLMs) against adversarial attacks. These models, like GPT-5, Gemini 2.5, and Claude 3.7 Sonnet, are increasingly vital across numerous applications, yet remain susceptible to subtle input manipulations that can drastically alter their predictions. The research team rigorously proved that FS provides certified robustness for feature representations within MLLMs, guaranteeing a certified lower bound on the cosine similarity between clean and adversarially perturbed features under l2-norm bounded attacks. This innovative approach transforms any existing feature encoder into a smoothed variant, offering a quantifiable level of protection against malicious inputs.

Specifically, the team’s method centres on enhancing the Feature Cosine Similarity Bound (FCSB), a key metric determining the robustness of the smoothed encoder. They discovered that the FCSB is directly linked to the Gaussian robustness score of the original encoder, which measures prediction consistency under Gaussian noise. Recognising that existing MLLMs often exhibit limited Gaussian robustness, the researchers introduced the Purifier and Smoothness Mapper (PSM), a plug-and-play module designed to improve this score without requiring extensive retraining of the MLLM itself. The PSM comprises a guided diffusion model for denoising Gaussian perturbations and a noise-aware residual network for refining feature distributions, working synergistically to enhance certified robustness.

Experiments demonstrate that integrating FS with PSM significantly outperforms adversarial training, reducing the Attack Success Rate (ASR) of various white-box attacks from approximately 90% to around 1%. This substantial improvement was achieved across diverse MLLMs and downstream tasks, highlighting the effectiveness and generalizability of the proposed method. The team employed a utility, robustness loss function to optimise the PSM, training it on data from various visual domains to maximise Gaussian robustness while preserving feature utility. This work establishes a new benchmark in certified defence for MLLMs, offering a provable guarantee of robustness against adversarial perturbations. By focusing on the feature space rather than direct prediction outputs, the researchers have overcome limitations of previous certified defence approaches and extended their applicability to more complex multimodal tasks. The development of FS-PSM opens exciting possibilities for deploying secure and reliable MLLMs in real-world applications, from autonomous systems to critical decision-making processes.

Feature-space smoothing for MLLM robustness guarantees improved generalization

Scientists developed Feature-space Smoothing (FS), a provable defence method guaranteeing robustness for the feature representations of Multimodal Large Language Models (MLLMs. The research pioneers a technique that transforms any feature encoder into a smoothed variant, ensuring a certified lower bound on the cosine similarity between clean and adversarial representations under l2-norm bounded attacks. Crucially, the study demonstrates that the Feature Cosine Similarity Bound (FCSB) of this smoothed encoder is directly linked to a Gaussian robustness score, quantifying the prediction consistency of the original feature extractor when subjected to Gaussian noise. To overcome limitations in existing Gaussian robustness scores, the team engineered the Purifier and Smoothness Mapper (PSM), a plug-and-play module designed to enhance this score without requiring any retraining of the MLLM itself.

The purifier component, implemented using a pre-trained guided diffusion model, operates before feature extraction to denoise Gaussian perturbations and maximise the Gaussian robustness score. Simultaneously, the smoothness mapper, a noise-aware residual network, refines the extracted features post-extraction, preserving the feature distribution while further boosting the score, all without modifying the core encoder parameters. Experiments employed a utility, robustness loss function to optimise the PSM, training it on data from diverse visual domains to improve Gaussian robustness and maintain feature utility for the encoder. The team rigorously evaluated the FS-PSM against state-of-the-art adversarial attacks specifically designed for MLLMs in a white-box setting, comparing its performance to strong adversarial training defences. The team measured the effectiveness of FS-PSM by evaluating its performance against state-of-the-art adversarial attacks specifically designed for MLLMs, demonstrating strong protection across various downstream tasks. This breakthrough delivers a certified lower bound on the feature cosine similarity between clean and adversarial representations under epsilon-bounded attacks, theoretically proving the robustness of the smoothed feature encoder.

Results demonstrate that the FS transforms any feature encoder into a smoothed variant, guaranteeing this certified lower bound on feature cosine similarity. The researchers indicate that the Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the Gaussian robustness score on the original encoder. Building upon this, the introduced PSM acts as a plug-and-play module, enhancing the Gaussian robustness score of MLLMs without requiring any retraining, and thus improving certified robustness under FS. Tests prove that the utility, robustness loss, used to optimise the PSM, which utilises a noise-aware residual network as the smoothness mapper, effectively enhances Gaussian robustness while preserving feature utility for the encoder.

Data shows that the PSM was trained on data from diverse visual domains to further enhance the Gaussian robustness. The work proposes a theoretical framework where the smoothed encoder maintains a certified lower bound on the feature cosine similarity between clean and adversarial representations, addressing limitations of existing randomised smoothing techniques. Scientists recorded that successful adversarial attacks typically induce substantial distortions in the model’s feature representations, and ensuring a robust feature encoder that minimises the discrepancy between clean and adversarial features is crucial for trustworthy prediction. Measurements confirm that the FS-PSM greatly enhances the adversarial performance of various MLLMs, dramatically reducing the ASR under various white-box attacks.

The team established that for any input x, the adversarial attacks aim to find an adversarial input x’ that misleads the model by solving a maximisation problem concerning the loss function. The research introduces a method to ensure a robust feature encoder that minimises the loss between clean and adversarial features, which is vital for reliable predictions. This advancement offers a pathway towards more trustworthy and secure MLLMs, with potential applications in safety-critical domains.

Feature-space smoothing for robust multimodal LLMs improves performance

Scientists have developed a new framework, Feature-space Smoothing (FS), to enhance the robustness of multimodal large language models (MLLMs) against adversarial attacks. This research pioneers a feature-space approach to establishing certified robustness, guaranteeing a lower bound on the similarity between clean and perturbed feature representations. The core innovation lies in transforming feature encoders into smoothed variants, providing a theoretical foundation for resilience against attacks, specifically, ensuring a certified level of similarity even when inputs are maliciously altered. Furthermore, researchers introduced the Purifier and Smoothness Mapper (PSM), a plug-and-play module designed to improve the Gaussian robustness score of MLLMs without requiring retraining.

Extensive experimentation across various MLLMs and tasks demonstrated that FS, combined with PSM, significantly reduces the success rate of white-box attacks, decreasing it from nearly 90% to approximately 1%. The authors acknowledge limitations related to the computational cost of certified robustness methods and the specific Gaussian robustness score used, suggesting that exploring alternative scoring functions could be beneficial. Future work could investigate extending this framework to other modalities and exploring its application in safety-critical applications where robustness is paramount.

👉 More information
🗞 Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing
🧠 ArXiv: https://arxiv.org/abs/2601.16200

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Point Bridge Achieves 66% Sim-To-Real Transfer with Novel 3D Representations

Point Bridge Achieves 66% Sim-To-Real Transfer with Novel 3D Representations

January 26, 2026
Rebuttalagent Achieves 18.3% Persuasion Boost Via Theory of Mind Modelling

Rebuttalagent Achieves 18.3% Persuasion Boost Via Theory of Mind Modelling

January 26, 2026
Llms Achieve Scalable Kernel Generation, Automating a Traditionally Time-Consuming Process

Llms Achieve Scalable Kernel Generation, Automating a Traditionally Time-Consuming Process

January 26, 2026