Researchers are tackling the challenge of reliably manipulating the outputs of powerful, closed-source multimodal large language models (MLLMs). Hui Lu, Yi Yu, and Yiming Yang from Nanyang Technological University, alongside Chenyu Yi, Xueyi Ke, and Qixing Zhang, present a new approach to crafting ‘universal’ adversarial perturbations, subtle changes to images that consistently force these models to identify a specified target, regardless of the input image or the specific commercial MLLM being used. This work is significant because current attack methods are typically tailored to individual images, limiting their practical application; MCRMO-Attack overcomes key difficulties in achieving this universality, demonstrably improving attack success rates by up to 23.7% on models like GPT-4o and Gemini-2.0 compared to existing techniques.
Overcoming challenges in crafting reusable adversarial attacks on multimodal AI models
Scientists have demonstrated a novel method for crafting universal adversarial perturbations against closed-source multimodal large language models (MLLMs), achieving a significant breakthrough in the field of AI safety. The research introduces MCRMO-Attack, a technique designed to consistently steer arbitrary inputs towards a specified target across unknown commercial MLLMs, representing a more stringent challenge than previous sample-specific attacks.
This work addresses the limitations of existing methods, which often struggle with reusability across different inputs and lack generalisation to unseen images. The team achieved this by tackling three core difficulties inherent in universal attacks: high-variance target supervision due to random cropping, unreliable token-wise matching caused by the suppression of image-specific cues, and the initialisation sensitivity of few-source per-target adaptation.
MCRMO-Attack stabilises supervision through Multi-Crop Aggregation with an Attention-Guided Crop, effectively capturing consistent target characteristics and mitigating the impact of randomisation. Furthermore, the research improves token-level reliability via alignability-gated Token Routing, selectively emphasising informative tokens and preventing spurious supervision.
Experiments show a substantial boost in attack success rates across commercial MLLMs, with a +23.7% improvement on GPT-4o and a +19.9% improvement on Gemini-2.0 compared to the strongest existing universal baseline. The study unveils a cross-target perturbation prior learned through meta-learning, yielding stronger per-target solutions and demonstrating comparable performance to optimisation from scratch with significantly fewer iterations, 50 steps with meta-initialisation versus 300 steps without.
This breakthrough reveals the potential to consistently induce specific, attacker-chosen incorrect outputs in MLLMs, regardless of the input image. The work opens avenues for enhancing the robustness of these models and mitigating potential vulnerabilities in real-world applications, paving the way for more secure and reliable multimodal AI systems. The research establishes a new benchmark for evaluating the transferability and generalisation capabilities of adversarial attacks on closed-source MLLMs.
Multi-Crop Attention and Alignability-Gated Token Routing for Robust Universal Attacks enhance adversarial transferability
Scientists developed MCRMO-Attack to address limitations in universal targeted transferable adversarial attacks against closed-source multimodal large language models (MLLMs). The research focused on creating a single perturbation applicable to arbitrary inputs, consistently steering them towards a specified target across unknown commercial MLLMs like GPT-4o and Gemini-2.0.
Existing sample-wise attacks were found to be ineffective due to overfitting to local visual patterns and failing to generalise to unseen images, prompting the team to pioneer a more robust universal approach. To stabilise supervision, the study employed Multi-Crop Aggregation with an Attention-Guided Crop.
This technique mitigates high-variance caused by target-crop randomness by aggregating information from multiple crops, weighted by an attention mechanism that prioritises relevant image regions. Researchers then improved token-level reliability using alignability-gated Token Routing, addressing the issue of universality suppressing image-specific cues.
This method explicitly guides the optimisation process by focusing on tokens that are likely to align with the target output, reducing spurious supervision. The team further meta-learned a cross-target perturbation prior to yield stronger per-target solutions. Experiments involved optimising perturbations on a source image pool, utilising random source selection and independent cropping of both source and target images.
The system delivers a significant performance boost, achieving a +23.7% improvement in unseen-image attack success rate on GPT-4o and a +19.9% increase on Gemini-2.0 compared to the strongest universal baseline. This demonstrates the efficacy of MCRMO-Attack in generating transferable adversarial perturbations that consistently induce targeted mispredictions in commercial MLLMs.
Multi-Crop Attention Routing improves transferable adversarial attacks on multimodal LLMs by enhancing robustness against diverse input perturbations
Scientists have developed a new attack method, MCRMO-Attack, that significantly boosts the success rate of adversarial attacks on closed-source multimodal large language models (MLLMs). The research demonstrates a +23.7% improvement in unseen-image attack success rate on GPT-4o and a +19.9% increase on Gemini-2.0, compared to the strongest existing universal baseline.
This work focuses on Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation consistently steers arbitrary inputs towards a specified target across multiple unknown commercial MLLMs. Experiments revealed that MCRMO-Attack stabilises supervision through Multi-Crop Aggregation with an Attention-Guided Crop, enhancing the reliability of token-level matching via alignability-gated Token Routing.
The team measured attack success rate (ASR) , defined as semantic similarity exceeding 0.3 between target and adversarial images, and average similarity (AvgSim) using an LLM-as-a-judge protocol. Results demonstrate that on seen samples, the method achieves 85.5% ASR on GPT-4o, versus 66.7% for UAP and 15.0% for UnivIntruder.
Further tests on Claude showed the universal perturbation exceeded all sample-wise competitors. Data shows that on unseen samples, GPT-4o reached 61.7% ASR with MCRMO-Attack, compared to 38.0% for UAP, alongside a higher keyword matching rate (KMRa of 52.0% versus 37.5%). Varying the number of optimization samples (N) revealed a clear trade-off between fitting and generalization; increasing N injected more diverse gradients, yielding more source-invariant perturbations.
Analysis of the perturbations themselves showed that MCRMO-Attack best preserves natural image appearance while reliably steering outputs towards the target, indicating stronger cross-sample generalization. Ablation studies confirmed the importance of each component, Multi-Crop Aggregation, Attention-Guided Crop, and Token Routing, in achieving optimal performance. The number of target crops used in multi-crop aggregation was also varied, with performance peaking at eight crops before plateauing.
Adaptable adversarial perturbations for robust multimodal model steering enable precise control and generalization
Scientists have demonstrated the first systematic study of universal targeted transferable adversarial attacks on closed-source multimodal large language models (MLLMs). Their research introduces MCRMO-Attack, a two-stage framework designed to learn a perturbation from limited source samples for adaptable target modification, subsequently refining it for target-specific universal perturbations.
This approach addresses challenges in achieving consistent adversarial steering across unknown commercial MLLMs, particularly concerning target supervision, token-level reliability, and initialization sensitivity. The MCRMO-Attack framework integrates multi-crop aggregation with an attention-guided crop, alongside alignability-gated token routing, to concentrate updates on structurally alignable elements within the input data.
Experiments across commercial MLLMs, including GPT-4o and Gemini-2.0, reveal a significant improvement in unseen-image attack success rates, boosting performance by +23.7\% and +19.9\% respectively, when compared to the strongest existing universal baseline. Furthermore, the method achieves competitive results with fewer updates and delivers substantially higher performance with limited computational budgets.
The authors acknowledge that the performance of their method is sensitive to the initial meta-initialization stage, with its removal leading to diminished performance, especially when using limited computational resources. Future research could explore methods to reduce this sensitivity and further enhance the robustness of the attack. These findings highlight the vulnerability of current MLLMs to adversarial manipulation and underscore the need for developing more robust defence mechanisms against such attacks, even in black-box scenarios where model internals are unknown.
👉 More information
🗞 Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization
🧠 ArXiv: https://arxiv.org/abs/2601.23179
