Researchers are increasingly questioning whether large vision-language models (VLMs) genuinely see or simply retrieve memorised information, a debate now fuelled by new findings concerning their responses to visual illusions. Xiaoxiao Sun, Mingyang Li, and Kun Yuan (University of Strasbourg, Technical University of Munich) alongside Min Woo Sun, Mark Endo, and Shengguang Wu, present compelling evidence that VLMs often fail to adapt to inverted visual illusions, despite the obvious change being readily apparent to humans. This research is significant because it moves beyond simply noting this inconsistency, introducing a novel framework called VI-Probe to systematically disentangle visual perception from memory-driven recall. By measuring stability and sensitivity using new metrics, the team demonstrate that response persistence stems from diverse mechanisms across different models , from GPT-5’s memory override to limitations in Qwen’s visual processing , challenging the notion of a single underlying cause and advocating for more nuanced evaluation of these powerful systems.
This research is significant because it moves beyond simply noting this inconsistency, introducing a novel framework called VI-Probe to systematically disentangle visual perception from memory-driven recall.
VI-Probe dissects visual perception in large models by
Scientists have demonstrated a new framework, VI-Probe, to investigate whether large vision-language models (VLMs) genuinely perceive visual changes or simply recall memorised patterns. The research addresses a puzzling phenomenon where VLMs often answer questions correctly about visual illusions in original images, yet consistently fail when the illusion’s factors are inverted, despite the change being readily apparent to humans. This inconsistency raises a fundamental question about the nature of visual processing within these models. Experiments were conducted across diverse VLM families, revealing that response persistence isn’t caused by a single mechanism, but rather by heterogeneous factors.
The work challenges the prevailing single-cause explanations for this behaviour and advocates for probing-based evaluation that assesses both knowledge and sensitivity to controlled visual alterations. The researchers constructed a comprehensive dataset comprising 27 illusion categories, alongside associated control images and linguistic prompts, enabling detailed analysis of model responses. This benchmark supports continuous generation of samples, providing a robust platform for future investigations. The findings reveal that different model families exhibit distinct failure modes, suggesting varying underlying mechanisms responsible for response persistence.
GPT-5’s memory override indicates a strong reliance on pre-existing knowledge, while Claude-Opus-4.1’s perception-memory competition suggests a more complex interplay between visual input and linguistic priors. Qwen variants, on the other hand, appear limited by their visual processing capabilities, struggling with representation entanglement and perception bottlenecks. These insights open avenues for developing more robust and perceptually accurate VLMs, capable of truly “seeing” rather than simply “remembering”. Data and code for VI-Probe are publicly available, facilitating further research and development in this critical area of artificial intelligence.
VI-Probe framework and VLM stability metrics are crucial
The study employed graded perturbations and matched visual controls, devoid of illusion inducers, to systematically investigate how these models process visual changes. Experiments were conducted across fifteen VLMs, spanning four families, OpenAI, Anthropic, Google, and Qwen, encompassing both closed-source and open-source models. Researchers accessed all models via APIs to ensure uniform inference conditions, prompting each in a unified zero-shot setting. To probe visual perception versus memory, the study analysed whether VLMs altered predictions when visual evidence contradicted prior knowledge, assessing ‘flip sensitivity’, and whether responses remained consistent across visual and prompt variations.
OpenRouter facilitated evaluation, with default temperature settings used for models lacking temperature control. The team engineered a setup involving complementary question pairs, asking the same question in different ways to disentangle linguistic polarity handling from visual judgement, then decomposed responses into three categories: PFA (both correct), CbW (complementary but wrong), and TFI (non-complementary). The analysis of these categories revealed that high Polarity-Flip Consistency did not necessarily equate to high accuracy, as models could consistently flip answers with reversed prompts while still providing incorrect responses. OpenAI models, such as GPT-5-Mini achieving 84.86% PFC, exhibited substantial CbW (GPT-5: 31.08%), indicating systematic visual errors masked by linguistic coherence.
Conversely, Claude models demonstrated more balanced ratios, while Google models showed moderate PFC with lower CbW, suggesting less systematic bias. The study further highlighted that model scale did not consistently reduce linguistic fixation or visual bias, with Qwen2.5-72B achieving higher PFC than Qwen3-8B, yet the latter demonstrating competitive visual accuracy. Furthermore, the team evaluated four image types, Control, Perturbed, to isolate memory effects from visual perception, utilising matched controls to remove illusion patterns while preserving baseline visual processing. A Template Fixation Index exceeding 45% indicated insufficient capacity to parse question semantics, rendering accuracy unreliable, as observed in Qwen2.5-3B with a TFI of 46.82%. This innovative methodology enabled the researchers to demonstrate that linguistic robustness is a prerequisite for reliable visual reasoning and that PFC serves as a quality threshold for assessing model performance.
GPT-5 overrides perception with memorised patterns
Results demonstrate that GPT-5 exhibits near-complete memory override when presented with visual illusions, consistently relying on pre-existing knowledge rather than processing the altered visual input. Data shows that Qwen variants are limited by visual-processing bottlenecks, specifically representation entanglement and perceptual constraints. The researchers recorded distinct sensitivities to visual factors like colour, size, and spatial configuration across different model families, indicating heterogeneous mechanisms responsible for response persistence. Tests prove that VI-Probe’s paired-prompt consistency and illusion-normalised effect size provide factor-isolated evaluation beyond standard accuracy measurements.
Specifically, Polarity-Flip Consistency and Template Fixation Index quantify linguistic robustness, while the illusion multiplier (R) isolates memory-driven versus perception-driven behaviour. Measurements confirm that VLMs fail to accurately interpret illusions before human perceptual thresholds, yet perform comparably to humans when presented with the matched visual controls, highlighting template-driven failures. The breakthrough delivers a controllable benchmark for probing the perception, memory boundary in VLMs, utilising graded illusion strengths and comprehensive vision and language variations. The study’s findings challenge the notion that response persistence is solely attributable to language priors, offering practical guidance for both evaluation and model design. This work establishes a new paradigm for assessing the balance between perception and memory in VLMs, paving the way for more robust and reliable visual reasoning systems.
Illusion inversion reveals VLM perception flaws and biases
The research demonstrates that while VLMs often succeed on standard visual illusions, they frequently fail when those illusions are inverted, persisting with the original, incorrect answer despite the obvious visual change. This suggests that VLMs may be relying more on recalling previously seen patterns than on actual visual understanding. Through systematic manipulation of visual cues and controls, the study reveals that this persistence stems from varied causes depending on the model architecture. Qwen variants suggest limitations in visual processing capacity. The authors acknowledge that their work focuses on controlled illusions and may not fully generalise to complex, real-world scenes.
They propose future research directions including the development of perception-first architectures, counterfactual consistency objectives during training, and the creation of graded datasets to better distinguish between perception-driven and memory-driven responses. Furthermore, they suggest exploring inference-time prompting strategies that require explicit visual verification. Ultimately, this research underscores a broader challenge in integrating visual evidence with prior knowledge, advocating for models that genuinely “see” rather than simply “recall”, and extending evaluation techniques beyond illusions to more realistic visual data.
👉 More information
🗞 Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
🧠 ArXiv: https://arxiv.org/abs/2601.22150
